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PREFACE 

This  book  is  an  outgrowth  of  the  two  parts  of  Mathematics  of  Statistics  by 
John  F.  Kenney  and  myself  (Van  Nostrand,  Part  I,  3rd  ed.,  1954;  Part  II, 
2nd  ed.,  1951).  It  seemed  advisable  to  prepare  a  new  textbook  which  would 
emphasize  the  inferential  and  decision-making  aspects  of  statistics  and  which 
would  assume  a  mathematical  background  on  the  part  of  the  reader  roughly 
intermediate  between  those  of  Parts  I  and  II. 

Part  I  has  for  several  years  been  used  as  a  textbook  for  first-year  students 
lacking  any  previous  knowledge  of  calculus  but  with  a  fairly  good  background 
of  high  school  or  junior  college  algebra.  Part  II,  on  the  other  hand,  pre- 
supposes at  least  two  years  of  calculus  with  a  corresponding  mathematical 
maturity,  and  is  primarily  intended  for  senior  or  even  graduate  students  desir- 
ing a  deeper  understanding  of  statistical  principles.  The  present  work  requires 
a  knowledge  of  elementary  calculus  and  is  adapted  to  the  usual  third-year 
university  level  in  mathematics.  The  first  chapter  contains  some  of  the  ele- 
ments of  set  theory,  but  this  material  is  being  more  and  more  widely  taught 
nowadays,  even  in  quite  junior  courses,  and  is  almost  essential  for  the  under- 
standing of  the  idea  of  probability. 

Throughout  the  book  proofs  are  given  where  possible,  but  in  many  instances 
the  mathematical  details  are  relegated  to  the  Appendix.  There  also  will  be 
found  brief  treatments  of  topics  which  the  student  is  unlikely  to  have  en- 
countered in  his  regular  courses — the  gamma  and  beta  functions,  Stirling's 
approximation,  Jacobians,  Bernoulli  numbers,  etc.  Although  matrix  algebra 
is  nowadays  much  more  prominent  than  formerly,  and  is  invading  even  fresh- 
man courses,  we  have  included  in  the  Appendix  enough  of  the  elements  of 
this  subject  to  permit  the  occasional  use  of  matrix  notation  in  the  body  of  the 
book.  The  brevity  and  convenience  of  this  notation  make  it  well  worth  while 
for  the  student  of  statistics,  at  least  in  the  more  advanced  parts  of  the  subject, 
to  spend  a  little  time  on  mastering  the  necessary  algebra. 

The  present  book  was  planned  as  another  joint  effort  by  Professor  Kenney 
and  myself.  However,  as  the  work  proceeded,  the  bulk  of  the  writing  was  left 
to  me,  and  Professor  Kenney  eventually  decided  that  his  name  ought  not  in 
fairness  to  be  attached  to  it.  I  am  much  indebted  to  him  for  his  generous 
action  and  for  his  continuing  advice  and  criticism  throughout  the  period  of 
preparation.  In  common  with  many  others  of  my  generation,  I  learned  the 
elements  of  statistics  from  Kenney's  Mathematics  of  Statistics,  Parts  I  and  II, 
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and  I  value  very  highly  the  privilege  of  having  collaborated  with  him  in  the 
later  revised  editions  of  these  books. 

Since  the  concept  of  probability  is  fundamental  to  statistical  inference,  the 
first  chapter  is  concerned  mainly  with  the  elements  of  the  calculus  of  prob- 
ability. The  treatment  is  heuristic  rather  than  rigorous,  but  does  attempt  to 
give  a  reasonable  foundation  of  the  idea  of  probability  as  a  measure,  and  to 
interpret  probability  objectively  in  terms  of  relative  frequencies  and  subjec- 
tively in  terms  of  betting  odds. 

The  second  chapter  contains  the  essential  statistical  techniques  of  summar- 
izing the  data  in  a  sample  prior  to  making  inferences  about  the  population. 
The  routine  computations  of  mean,  variance,  median,  etc.,  are  described  for 
the  benefit  of  those  students  without  any  previous  statistical  training.  The 
general  properties  of  distributions,  with  their  cumulants  and  cumulant  gen- 
erating functions,  are  then  discussed,  and  illustrated  by  reference  to  a  number 
of  special  probability  distributions  (binomial,  Poisson,  normal,  gamma  and 
beta,  chi-square,  log-normal) .  This  leads  up  to  the  relation  between  a  sample 
and  the  population  from  which  it  is  drawn  and  the  concepts  of  confidence 
intervals  and  fiducial  inference. 

In  Chapter  6  the  principles  of  testing  hypotheses  and  making  decisions  with 
assigned  risks  of  error  are  introduced.  The  method  of  maximum  likelihood 
and  the  concept  of  the  power  of  a  test  are  dealt  with.  All  this  is  crucial  in  any 
discussion  of  statistical  inference. 

After  a  treatment  of  different  sampling  procedures,  including  sequential 
methods,  the  usual  exact  statistical  tests  on  samples  from  a  normal  population 
are  discussed.  This  is  followed  by  the  analysis  of  variance  (which  also  as- 
sumes normality  as  generally  applied)  and  by  a  discussion  of  certain  non- 
parametric  methods  which  can  be  used  when  it  is  unsafe  to  postulate  a  normal 
population. 

Bivariate  (linear  regression,  correlation  and  contingency)  problems  are  next 
dealt  with,  followed  in  Chapter  12  by  non-linear  regression  and  curve-fitting. 
Finally  there  is  a  short  chapter  on  multivariate  problems  and  stochastic  proc- 
esses, giving  only  the  barest  introduction  to  these  extensive  fields. 

Sets  of  problems  are  included  at  the  end  of  each  chapter.  These  are 
arranged  in  groups  according  to  the  sections  in  the  chapter  to  which  they 
relate.  Within  each  group  the  problems  are  roughly  in  order  of  difficulty. 
An  attempt  has  been  made  to  maintain  a  balance  between  numerical  examples 
and  questions  on  pure  theory.  For  some  of  the  numerical  problems  it  is  very 
desirable  to  have  the  use  of  a  desk  computer;  everyone  who  has  much  to  do 
with  statistics  is  almost  bound  to  acquire  facility  in  the  use  of  such  computing 
machines.  Much  can  be  done  with  even  so  inexpensive  and  compact  a  device 
as  the  little  pocket  "Curta"  calculator.  Hints  are  provided  for  the  solution  of 
the  more  difficult  problems.     I  am  again  grateful  to  Professor  Kenney  for 
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permission  to  include  a  number  of  problems  taken  from  Parts  I  and  II  of  our 
joint  work  already  mentioned. 

The  tables  given  in  Appendix  B  should  suffice  for  most  statistical  tests,  but 
the  student  should  if  possible  have  access  to  more  complete  sets  of  tables, 
such  as  Pearson  and  Hartley's  Biometrika  Tables  for  Statisticians,  Vol.  I 
(Cambridge  University  Press,  1954)  or  Fisher  and  Yates'  Statistical  Tables 
for  Biological,  Agricultural  and  Medical  Research  (Oliver  and  Boyd,  5th  ed., 
1957). 

For  permission  to  reprint  tables  or  portions  of  tables,  I  am  indebted  to  the 
following:  Sir  Ronald  A.  Fisher,  F.R.S.,  and  Messrs.  Oliver  and  Boyd,  Ltd., 
Edinburgh  (Tables  B.3  and  B.4) ;  Professor  M.  S.  Bartlett  and  the  Department 
of  Statistics,  University  College,  London  (Table  B.l);  Dr.  A.  J.  Jonckheere, 
Professor  E.  S.  Pearson,  and  the  publishers  of  "Biometrika"  (Table  B.10,  8.3 
and  8.5);  Dr.  D.  Auble  and  the  Institute  for  Educational  Research,  Indiana 
University  (Table  B.9);  Dr.  G.  W.  Snedecor  and  the  Iowa  State  University 
Press  (Table  B.5) ;  Dr.  F.  J.  Massey  and  the  American  Statistical  Association 
(Table  B.6) ;  Dr.  J.  E.  Walsh  and  the  American  Statistical  Association  (Table 
B.8);  Dr.  F.  Wilcoxon  and  the  American  Cyanamid  Company  (Table  10.8). 

A  list  of  references  is  given  at  the  end  of  each  chapter,  but  this  list  is  not 
intended  as  even  a  partial  bibliography.  It  serves  merely  to  indicate  to  the 
student  a  few  books  or  papers  in  which  he  may  find  a  fuller  treatment,  or  more 
detailed  proofs,  of  some  of  the  statements  in  the  text,  and  also  in  a  few  cases 
to  give  the  source  of  the  numerical  data  used  in  problems. 

In  so  vast  a  subject  as  modern  mathematical  statistics  a  textbook  writer  has 
to  be  selective.  It  is  highly  probable  that  some  statisticians,  looking  at  this 
book,  will  feel  that  the  emphasis  is  misplaced  here  and  there  or  that  a  better 
choice  of  topics  could  have  been  made.  I  can  only  plead  that  to  me  the  choice 
has  seemed  reasonable  for  the  type  of  student  I  had  in  mind. 

It  is  hardly  possible  for  me  to  express  adequately  my  indebtedness  to  all 
the  teachers  and  writers  from  whose  lectures  or  papers  I  have  derived  help  and 
inspiration.  I  am  grateful  also  to  the  publishers'  readers  who  examined  this 
book  in  manuscript  and  offered  valuable  suggestions  for  improvement.  In 
conclusion,  I  should  like  to  express  my  appreciation  of  the  help  of  Mrs.  I.  Maj, 
who  coped  most  efficiently  with  the  job  of  typing  a  manuscript  plentifully 
sprinkled  with  mathematical  symbols. 

E.  S.  K. 


FOREWORD 

This  book  contains  sufficient  material  for  a  two-semester  or 
full-year  course,  with  three  lecture  periods  per  week.  Some 
less  important  sections,  which  might  be  omitted  on  a  first 
reading,  are  starred. 

For  a  one-semester  course,  it  would  be  advisable  to  read 
most  of  Chapters  1  to  6  (omitting  the  starred  sections)  and 
also  the  first  parts  of  Chapters  8  and  1 1 .  The  instructor  will 
naturally  have  the  responsibility  of  deciding  on  the  material 
which  he  considers  most  relevant  to  the  needs  of  his  particular 
students. 

References  to  numbered  equations  within  the  same  section 
are  to  the  last  part  of  the  number  only.  Thus  Eq.  (1.10.6) 
if  referred  to  within  §1.10,  would  be  quoted  as  Eq.  (6).  If 
referred  to  in  any  later  section,  the  complete  number  is 
quoted. 

Numbers  enclosed  in  square  brackets,  such  as  [2],  refer  to 
the  literature  references  given  at  the  end  of  the  chapter. 
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Chapter  1 
PROBABILITY 


1.1  Uncertainty  of  Statistical  Inference  In  everyday  life  we  are  again  and 
again  faced  with  the  necessity  of  making  decisions.  Many  of  these  are  trivial, 
some  may  be  serious,  but  almost  always  there  is  an  element  of  uncertainty  about 
the  wisdom  of  the  decision  we  make.  Scientists  in  their  regular  work  have  a 
similar  problem.  They  have  to  draw  conclusions  from  their  enquiries  or  experi- 
mental results,  but  their  observations  are  liable  to  error  and  may  from  their  very 
nature  be  subject  to  considerable  irregular  fluctuations.  Any  conclusions  that 
the  scientists  draw  will  therefore  not  be  rigid  and  unalterable,  but  will  merely  be 
more  or  less  probable.  The  theory  of  probability  is  the  groundwork  of  scientific 
inference,  and  as  such  is  the  subject  of  this  chapter. 

Statistics  is  concerned  with  variables  that  fluctuate  in  a  more  or  less  unpre- 
dictable way,  such  as  the  monthly  total  of  highway  fatalities  in  the  state  of  New 
York  or  the  yearly  average  yield  of  wheat  in  bushels  per  acre  on  a  Saskatchewan 
farm.  There  may  be  assignable  causes,  with  predictable  effects  on  the  total  of  fatal 
accidents  in  a  particular  month,  but  no  one  would  expect  to  be  able,  month  after 
month,  to  predict  the  total  exactly.  The  essence  of  a  statistical  variable  is  that, 
to  some  extent  at  least,  it  is  unpredictable.  We  call  this  characteristic  randomness 
and  will  later  give  it  a  more  precise  definition. 

Since  in  almost  all  experimental  work,  and  particularly  in  the  biological  and 
social  sciences,  the  results  are  influenced  by  a  variety  of  conditions  largely 
beyond  the  experimenter's  control,  there  is  always  this  element  of  randomness 
about  the  results  of  experiment.  Variations  in  rainfall  and  soil  composition  affect 
plant  yields;  individual  peculiarities  affect  the  behavior  of  rats  or  guinea  pigs. 
Even  in  the  "exact"  sciences,  such  as  physics  and  astronomy,  with  their  relatively 
high  precision  of  measurement,  there  is  still  a  residuum  of  unavoidable  experi- 
mental error.  It  is  the  task  of  statistical  inference  to  draw  valid  conclusions 
about  the  world  around  us  from  such  limited  and  imperfect  observations  as  we 
can  make.  Since  these  conclusions  are  not  certain,  we  would  like  to  attach  to 
them  an  estimate  of  the  probability  of  their  truth.  How  this  can  be  done  in 
certain  types  of  problems  will  be  told  in  later  chapters. 

Probability  has  in  modern  atomic  physics  an  even  deeper  significance.  The 
"principle  of  indeterminacy,"  formulated  originally  by  W.  Heisenberg,  lays  it 
down  as  a  cardinal  truth  of  physics  that  certain  pairs  of  variables,  such  as 
position  and  momentum,  or  energy  and  time,  cannot  both  be  measured,  even  in 
principle,  with  unlimited  precision,  but  that  the  more  accurately  one  of  the  pair 
is  known,  the  less  accurately  can  the  other  be  determined.  This  has  nothing  to  do 
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with  the  limitations  and  errors  of  actual  physical  apparatus — it  is  a  theoretical 
limitation  that  would  hold  with  perfect  apparatus.  A  consequence  of  this 
principle  is  that  the  variable  representing  a  particle  in  modern  quantum  theory 
is  something  that,  as  far  as  it  can  be  represented  physically  at  all,  is  a  probability 
— the  probability,  roughly  speaking,  that  the  particle  is  at  a  certain  position  in 
space  at  a  certain  time. 

1.2  Intuitive  Idea  of  Probability  Everyone  has  a  general  idea  of  what  is 
meant  by  the  words  "probable"  and  "probability."  We  hear  the  radio  announcer 
at  breakfast-time  say  that  it  will  probably  rain  before  the  afternoon,  and  we 
decide  to  wear  a  raincoat  to  the  office.  It  is,  however,  by  no  means  a  simple 
matter  to  give  a  definition  of  probability  that  will  adequately  cover  all  cases  and 
serve  as  a  satisfactory  foundation  for  statistical  inference. 

Some  people  feel  that  probability  refers  only  to  a  state  of  mind,  the  strength 
of  one's  belief  in  a  proposition.  In  order  to  make  probability  more  than  merely 
subjective,  these  writers  have  to  speak  of  the  degree  of  "rational"  belief  in  a 
proposition,  something  that  should  perhaps  be  called  credibility  rather  than 
probability.  By  making  some  rather  arbitrary  assumptions  it  is  possible  to 
arrive  at  a  numerical  calculus  of  probabilities,  but  we  shall  not  for  the  present 
pursue  this  line  of  thought  any  further.  (See  references  [1  ]  and  [2]  at  the  end  of  the 
chapter,  and  also  §  1.8). 

The  mathematical  treatment  of  probability  arose  historically  out  of  dis- 
cussions of  games  of  chance  in  the  17th  century  (see  [3],  [4],  [5]).  If  a  die  seems  to 
be  honest  and  well  made,  it  is  reasonable  to  suppose  that  it  is  equally  likely  to 
fall,  if  rolled  in  the  customary  way,  with  any  of  its  six  faces  uppermost.  This 
judgment  merely  involves  a  recognition  of  the  fundamental  symmetry  of  the  die. 
It  is  not  perfectly  symmetrical,  of  course,  since  the  faces  are  marked  differently, 
but  we  judge  that  this  minimal  lack  of  symmetry  will  not  appreciably  affect  the 
chances.  Similarly  if  five  cards  are  dealt  from  a  well-shuffled  deck,  we  feel  that 
(unless  the  dealer  is  crooked)  any  specified  set  of  five  cards  is  about  as  likely  as 
any  other.  In  these  cases  it  is  a  comparatively  simple  matter  to  calculate  the 
probabilities  of  events  that  may  be  of  interest — the  probability  for  instance  that 
a  6  turns  up  on  a  die,  or  that  the  five  cards  dealt  from  a  deck  of  52  are  all  of  one 
suit.  Trouble  arises  when  we  cannot  assume  the  fundamental  symmetry — how 
can  we  assess  the  probability  of  6  with  a  loaded  die  ? 

There  are  many  writers  who  feel  that  the  only  kind  of  probability  definition 
that  makes  much  sense,  particularly  in  statistics,  is  based  on  the  idea  of  the 
relative  frequency  with  which  events  of  interest  happen  in  a  long  series  of  similar 
trials.  To  assess  the  probability  of  6  with  a  die  (loaded  or  not)  we  roll  it  a  large 
number  of  times  and  count  the  number  of  times  6  turns  up.  The  ratio  of  this 
number  to  the  total  number  of  rolls  is  the  relative  frequency  of  6  and  is  an 
approximation  to  the  true  probability  of  6.  Unfortunately  we  cannot  simply  say 
that  the  probability  is  the  limiting  value  of  this  ratio  as  the  number,  n,  of  trials 
increases,  since  the  ratio  does  not  tend  to  a  limit  in  the  strict  mathematical  sense. 
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What  we  can  say  is  that,  if  some  conditions  are  satisfied,  then  in  the  long  run  it 
becomes  almost  certain  that  the  difference  between  the  observed  relative  fre- 
quency and  a  fixed  limit  will  be  less  than  any  number  we  like  to  name.  The 
smaller  this  number,  of  course,  the  greater  the  number  of  trials  we  shall  have  to 
make  to  reach  a  state  of  "almost  certainty."  The  conditions  to  be  satisfied  are, 
first,  that  the  successive  trials  are  independent  (which  means  that  the  result  of 
any  trial  is  not  influenced  by  what  has  happened  on  previous  trials)  and,  second, 
that  the  context  of  the  trials  has  remained  essentially  unaltered.  This  means  that 
all  the  circumstances  surrounding  the  trials  are  either  unchanged  during  the 
whole  set  of  observations  or,  if  they  are  changed,  have  no  appreciable  effect  on  the 
trials.  The  decision  as  to  what  circumstances  are  or  are  not  relevant  is  one 
which  is  often  needed  in  experimental  work,  and  must  be  made  on  the  basis  of 
experience. 

The  above  statement  is  a  special  form  of  the  "law  of  large  numbers,"  which 
will  be  stated  more  precisely  later  on.  As  the  number  of  trials  increases,  the 
relative  frequency  of  the  event  in  question  converges  stochastically  (or,  as  it  is 
sometimes  expressed,  "converges  in  probability")  to  a  limiting  value,  which  is 
defined  as  the  probability  of  the  event.  For  a  detailed,  semipopular  discussion 
of  this  concept,  see  [6]. 

1.3  Events  We  shall  take  the  point  of  view  that  probability  relates  to 
events,  which  are  phenomena  that  may  be  observed  either  to  happen  or  not  to 
happen  in  a  particular  context.  Thus  if  two  dice  are  rolled  repeatedly,  one  event 
in  which  we  may  be  interested  is  a  total  of  seven  spots  turning  up.  This  event  has 
a  definite  probability  within  the  context  of  rolls  of  these  particular  dice.  The 
context  includes  not  only  the  rolls  that  have  actually  occurred,  but  all  those  that 
might  conceivably  occur  if  we  had  unlimited  time  and  patience  to  go  on  rolling 
the  dice.  Another  type  of  event  is  a  measurement,  for  example,  of  an  intelligence 
quotient  for  a  10-year-old  Negro  boy.  The  observed  value,  say  114,  to  the 
nearest  whole  number,  is  one  that  has  a  probability  significance  within  the  con- 
text of  all  such  measurements  on  boys  in  the  United  States  or  only  those  on 
10-year-old  boys,  or  perhaps  only  those  on  all  10-year-old  Negro  boys.  The 
particular  context  depends  on  what  probability  we  are  interested  in. 

Events  may  be  classified  as  simple  or  compound.  A  compound  event  is  one 
that  can  be  decomposed  into  a  set  of  simple  events,  whereas  a  simple  event 
cannot  be  decomposed  any  further.  The  occurrence  of  6  in  a  throw  of  a  die  is  a 
simple  event.  The  occurrence  of  7  with  two  dice  is  a  compound  event,  because  it 
can  be  split  up  into  six  simple  events,  each  of  which  corresponds  to  the  same 
compound  event,  namely,  6  and  1,  5  and  2,  4  and  3,  3  and  4,  2  and  5,  1  and  6. 
Here  the  first  number  in  each  pair  represents  the  number  of  spots  shown  by  the 
first  die  and  the  second  number  that  shown  by  the  second  die. 

A  simple  event  may  be  represented  by  a  point  in  a  suitable  "space."  The 
space  corresponding  to  the  number  of  spots  on  the  upper  face  of  a  die  consists  of 
six  isolated  points,  numbered  1,  2,  3,  4,  5,  6,  on  the  axis  of  real  numbers.  Any 
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observed  simple  event,  such  as  a  6,  is  represented  by  one  of  these  points.  The 
compound  event,  "a  throw  of  at  least  4,"  is  represented  by  the  three  points 
labelled  4,  5,  and  6.  The  space  representing  all  the  possibilities  in  a  particular 
situation  is  often  denoted  by  ^,  meaning  the  universal  set  of  all  possibilities. 

The  space,  °U,  corresponding  to  the  possible  outcomes  with  two  dice,  is  a 
square  lattice  of  36  points,  as  in  Figure  1 .  The  x-coordinate  represents  the  first 
die  and  the  ^-coordinate  the  second  die.  The  compound  event,  "the  total  of  spots 
shown  is  7,"  is  represented  by  the  set  of  six  points  which  are  ringed  in  the  figure. 
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Fig.  1     Simple  and  compound 
events,  with  two  dice 


Husband's  Age        100      X 


Fig.  2    Compound  event  in  continuous 
space 


An  insurance  company  may  be  interested  in  the  age  distribution  of  married 
couples.  If  x  years  represents  the  husband's  age  and  y  years  the  wife's  age,  the 
space  °ll  for  the  event,  "the  husband  is  x  years  old  and  the  wife  is  y  years  old," 
is  a  region  of  the  x-y  plane  betweeri  limits  (say  15  and  100)  for  both  variables. 
The  compound  event,  "the  husband  is  older  than  the  wife,"  will  be  represented 
by  the  shaded  area  in  Figure  2,  below  the  line  x  =  y. 


1 .4  Elements  of  Set  Theory  Applied  to  Events  We  shall  denote  events  by 
letters,  A,  B,  C.  .  . ,  or  sometimes  by  Alf  A2i  A3  . . . .  These  events  will  be 
represented  in  diagrams  by  regions  (or  points)  of  the  appropriate  space.  For 
convenience  the  diagrams  will  be  drawn  as  though  the  space  were  continuous, 
although  it  may  often  consist  in  reality  of  a  set  of  discrete  points  (as  in  the 
example  of  the  two  dice,  Figure  1). 

The  basic  idea  of  using  diagrams  like  those  in  Figures  3  and  4  seems  to  be  due 
to  the  1 8th-century  Swiss  mathematician  Euler.  Refinements  were  made  by  the 
British  logician  John  Venn  (1834-1883),  and  such  diagrams  are  now  usually 
called  Venn  diagrams. 

The  contrary  event  to  A  (i.e.,  the  event  "A  does  not  happen")  will  be  repre- 
sented by  A,  which  may  be  read  as  "^4-tilde,"  or  "not-A."  The  events  A  and  A 
together  make  up  the  whole  of  the  appropriate  space  °ll.  Hence  A  is  often  called 
the  complement  of  A  (see  Figure  3a). 


1.5 


PROBABILITY 


If  A  and  B  are  events  within  the  same  space  °U,  A  is  said  to  imply  B  (or, 
symbolically  A  a  B)  if  whenever  A  occurs  B  necessarily  occurs.  The  region 
corresponding  to  A  is  nowhere  outside  the  region  corresponding  to  B  (see  Figure 
3b).  For  example,  the  event,  "the  number  of  spots  shown  by  a  die  is  4,"  implies 
the  event,  "the  number  of  spots  shown  is  even."  If  A  c  B  and  B  a  A,  the  events 
A  and  B  are  equivalent  (symbolically,  A  =  B). 

The  event,  "both  A  and  £"  (i.e., 
the  simultaneous  occurrence  of  both 
events),  is  called  the  intersection  of  A 
and  B  and  is  represented  by  A  n  B. 
In  a  Venn  diagram  it  is  represented  by 
the  intersection  of  the  areas  corres- 
ponding to  A  and  B  (Figure  3c). 

The  event,  "A  and/or  JS"  (i.e.,  the 
occurrence  of  at  least  one  of  the  events 
A  and  B),  is  called  the  union  of  A  and  B, 
and  is  denoted  by  A  u  B.  It  is  repre- 
sented by  the  whole  area  that  is  included 
in  either  the  A  area  or  the  B  area 
(Figure  3d).  If  the  events  A  and  B  are 
such  that  they  cannot  both  happen  at 
the  same  time,  they  are  said  to  be  dis- 
joint, or  mutually  exclusive.  In  this  case 
A  u  B  is  represented  by  the  sum  of  the 
areas  of  A  and  B  (Figure  3e),  and  the 
union  is  then  often  denoted  by  A  +  B. 
The  symbol  "  +  "  here  denotes  a  logical 
sum  and  means  "either  ...  or".    It  is  Fig.  3    Venn  diagrams 

not  the  plus  sign  of  arithmetic. 

The  above  notation  is  readily  extended  to  a  finite  number  of  events.  The 
intersection  of  the  events  Al9  A2  ...  An  is  denoted  by 


u 

0* 

A+A  =  U 

(a) 

W 

A(\B 
(c) 


At  n  A2  n  A3  .  .  .  n  Ar 


or 


while  their  union  is  denoted  by  AY  u  A2  .  .  .  u  A„,  or  UJE-i.  4t"   ^  tne  events 
are  disjoint,  the  union  is  often  denoted  by  Yi=i  Ak. 

The  event  A  +  -2,  or  <%,  may  be  interpreted  as  a  sure  event  (one  that  is 
bound  to  happen).  It  is  represented  by  the  whole  of  the  appropriate  space.  The 
event  A  n  A,  or  £  may  be  interpreted  as  an  impossible  event.  The  set  of  points 
representing  &  is  said  to  be  null,  or  empty. 


1.5  Some  Theorems  on  Union  and  Intersection  of  Events    The  following 
theorems  may  readily  be  proved  from  the  definitions  of  the  preceding  section. 
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They  are  most  easily  appreciated  by  reference  to  the  corresponding  Venn 
diagrams  (Figures  3  and  4). 

Theorem  1.1     IfAczB,  then  B  cz  A. 

Theorem  1.2    A  n  B  =  B  n  A  and  A  u  B  =  B  u  A.  In  algebraic  language, 
both  operations  are  commutative. 

Theorem  1.3     (A  n  B)  n  C  =  A  n  (B  n  C)  =  A  n  B  n  C, 
and    (AvB)  u-C  =  Av(Bv-C)  =  A  u  B  u  C. 
Both  operations  are  associative. 

Theorem  1.4    A  n  (B  u  C)  =  (A  n  B)  u  (A  n  C). 
A  v  (B  n  C)  =  (A  v  B)  n  (A  v  C). 

See  Figure  4,  a  and  b.   The  two  operations  are  distributive  with  respect  to  one 
another. 


An(BuC)=  Au(jBnQ= 

(AnB)u(AnC)  (Au£)n(AuQ 

(a)  (b) 

Fig.  4    Venn  diagrams  for  three  events 

Theorem  1.5     (A  n  B)  =  A  u  B. 
(TZTS)  =  AnB. 

It  will  be  observed  that  in  Theorems  1.2  to  1.5  there  is  a  perfect  duality 
between  union  and  intersection.  If  in  any  theorem  the  symbols  of  union  and 
intersection  are  interchanged,  another  true  theorem  results. 

For  further  elementary  discussion  of  sets,  see  [7]. 

1.6  Probability  as  a  Measure  An  event,  as  we  have  seen,  corresponds  to  a 
subset  of  the  universal  set  %  of  all  possibilities  in  the  particular  situation. 
Suppose  we  can  in  some  reasonable  way  assign  a  weight  (a  non-negative  number) 
to  each  point  or  element  of  area  of  °U,  so  that  the  total  of  these  weights  is  1 . 
Then  the  weight  assigned  to  an  event  A  will  be  the  sum  of  the  weights  of  all  the 
points  (or  elements  of  area)  which  make  up  A.  This  is  called  the  measure  of  A. 
We  then  define  the  probability  of  A  as  equal  to  its  measure. 

It  may  be  noted  that  the  concept  of  measure  has  a  much  wider  meaning  than 
this  in  modern  mathematics.  Probability  measure  is  only  one  among  many 
types  of  measure. 
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The  assignment  of  weights  in  an  actual  problem  will  depend  upon  the 
information  available  and  on  an  analysis  of  all  the  possibilities.  Thus  if  we  have 
no  reason  to  doubt  the  accuracy  of  a  given  die  and  the  honesty  with  which  it  is 
rolled,  it  seems  reasonable  to  assign  the  same  weight  to  each  of  the  six  logically 
possible  outcomes.  In  other  words  we  take  the  weights  as  each  equal  to  J,  since 
they  must  add  up  to  1 .  The  measure  (and  therefore  the  probability)  of  the  event, 
"the  number  of  spots  is  even,"  is  f,  since  this  event  includes  the  three  simple 
events,  2,  4  and  6. 

Suppose  that  there  are  three  horses — A,  B,  C — in  a  race,  and  that,  on  form, 
I  judge  that  A  is  twice  as  likely  to  win  as  either  B  or  C,  but  that  the  chances  of 
B  and  C  are  about  equal.  I  would  then  assign  weights  \,  \,  \  to  A,  B,  C, 
respectively.  The  probability  of  the  event,  "either  A  or  B  wins,"  would 
be  \  +  \  =  J.  The  probability  of  the  event,  "either  B  or  C  wins,"  would  be 
i    i    i  —  i 

4    +    4    —    2- 

The  techniques  of  calculating  probabilities,  once  the  weights  have  been 
assigned,  will  occupy  the  major  part  of  this  chapter. 

1.7  Properties  of  a  Probability  Measure  The  basic  properties  of  the 
probability  measure  P(A)  of  an  event  A  are 

(1.7.1)  P(A)  =  l     if    A=<%. 
This  means  that  some  event  in  °tt  is  bound  to  happen. 

(1.7.2)  0  <  P(A)  <  1     for  every  A  in  %. 

(1.7.3)  P(A  u  B)=  P(A)  +  P(B)  -  P(A  n  B). 

The  first  two  follow  immediately  from  the  definition.  The  third  may  be 
appreciated  by  reference  to  Figure  3,  c  and  d.  In  reckoning  P(A)  +  P(B),  the 
weight  of  every  element  of  the  intersection  is  counted  twice.  The  sum  is  therefore 
greater  than  P(A  u  B)  by  the  measure  of  this  intersection. 

If  A  and  B  are  disjoint, 

(1.7.4)  P(A  +  B)  =  P(A)  +  P(B). 

This  is  called  the  addition  law  for  probabilities. 

Since  A  and  A  are  disjoint,  and  A  +  A  =  <%,  it  follows  from  equations  (1) 
and  (4)  that 

(1.7.5)  P(A)  +  P(A)  =  \. 

This  rule  is  often  useful  when  it  happens  to  be  easier  to  calculate  P(A)  than 
P(A).  We  can  find  P(A)  by  calculating  1— P(A). 

Since  the  impossible  event  $  is  the  complement  of  the  sure  event,  we  see  that 

(1.7.6)  P{$)  =  1  -?(f)=0. 
The  probability  of  an  impossible  event  is  zero. 
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f~      2=^ ^        N 
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ft 

Fig.  5    Union  of  several  events 


It  should  be  understood  that  because  P(A)  =  0  it  does  not  necessarily  follow 
that  A  is  impossible.  If  the  set  °U  is  represented  by  the  points  inside  a  square  of 
unit  side,  it  would  be  reasonable  in  some  contexts  to  make  the  measure  of  any 

sub-region  of  the  square  equal  to  the  area 
of  that  region.  The  measure  of  any  set  of 
isolated  points  or  finite  line  segments  would 
then  be  zero,  although  we  could  not  say 
that  these  points  or  lines  correspond  to 
impossible  events.  If  a  property  holds  with- 
in a  given  region,  except  for  a  set  of  points 
of  measure  zero,  it  is  said  to  hold  almost 
everywhere.  The  statement,  therefore,  that 
P{A)  =  0  does  not  imply  that  A  is  im- 
possible. It  does  imply  that  the  measure 
of  the  points  corresponding  to  A  is  zero. 
In  the  same  way,  a  probability  of  1  does  not  necessarily  mean  that  the  event  in 
question  is  absolutely  sure,  but  only  that  the  exceptional  points  in  the  appro- 
priate space  have  a  measure  zero. 

Theorem  1.6    IfAczB,  then  P(A)  <  P(B). 

To  prove  this,  let  C  be  the  part  of  B  not  included  in  A;  then  B  =  C  +  A. 
By  equation  (4)  above, 

P(B)  =  P(C)  +  P(A)  >  P{A\ 

since  P(C)  >  0  by  (2). 

Theorem  1.7    For  n  events  Ak(k  =  1,  2  .  .  .  n), 

\k=l        1  /c=l 

It  is  readily  seen  from  a  Venn  diagram  (see  Figure  5)  that 


[J  Ak  =  Al  +  Ax  n  A2  +  Al  n  A2  n  A3 

k=l 

+  .  .  .  +  Ax  n  A2  .  .  .  n  An_  t  n  An. 

These  disjoint  sets  (for  the  case  n  =  4)  are  shaded  differently  in  the  figure.  But 
Ax  n  A2  c=  A2f  Ax  n  A2  n  A3  a  A3,  etc.;  therefore,  by  Theorem  1.6, 
P(AX  n  A2)  <  P{A2\  etc.  It  follows  that 

p(0a)  <  P{Ai)  +  P(A2)  +  .  •  •  +  P(An)  =  £  P(Ak). 

\k=l       /  k=l 

This  theorem  may  be  extended  to  include  infinite  sets  of  events.  The  equality 
sign  holds  for  disjoint  events. 
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Theorem  1.8     The  extension  of  Eq.  (3)  to  three  events  gives 

(1.7.7)  P(Au  BkjC)  =  P(A)  +  P(B)  +  P(C)  -  P(A  n  B) 

-  P(A  nC)  -  P(B  nC)  +  P(04  n  B  n  CJ 

This  is  easily  proved  by  writing  D  for  the  event  B  u  C  and  applying  Eq.  (3). 
Note  that,  by  Theorem  1.4,inD  =  (^n5)u(in  C). 
The  result  may  be  further  extended  to  n  events  as  follows : 

(1.7.8)  p( \JAj)  =  I  P(A*)  -  I'  P(^  n  4J 

V/=i     /       j  j'fc 

+  X"  P(^  n  ^fc  n  Az)  -  .  .  . 

jkl 


»-'p(n/i,). 


where  Yl  jk  means  that  the  sum  is  over  all  j  and  k  withy  #  /:,  £"jW  means  that  the 
sum  is  over  ally,  k,  I  with  no  two  of  these  equal,  and  so  on. 

1 .8  Interpretation  of  Probability  in  Terms  of  Betting  Odds  A  recent  book  by 
L.  J.  Savage  [2]  emphasizes  the  personal  aspect  of  probabilities.  Savage  argues 
that  the  rational  man  acts  as  if  there  exists  for  him,  corresponding  to  each 
situation  in  which  he  has  to  make  a  decision,  a  set  of  probabilities  and  a  set  of 
utilities.  The  probabilities  relate  to  the  various  states  or  aspects  of  the  world 
that  may  be  supposed  to  exist  (relevant  to  the  particular  situation).  The  utilities 
measure  in  some  way  the  values  that  will  accrue  to  him  from  each  particular 
decision  for  each  particular  unknown  state  of  the  world.  He  acts  in  such  a  way 
as  to  make  the  utility  he  expects  to  get  as  great  as  possible. 

The  probabilities,  being  a  personal  matter,  can  be  assessed  by  presenting  the 
man  with  a  suitable  bet.  If  there  are  just  two  relevant  states  of  the  world,  sl  and 
s2,  with  probabilities  pY  and/?2>  and  if  the  man  is  offered  betting  odds  of  C  to  1 
against  sl9  he  will  take  the  bet,  provided  p2/Pi  <  C.  By  varying  C  until  the  bet  is 
accepted,  the  values  of  pl  and  p2  can  be  determined  (remembering  that 
px  +  p2  =  1).  When  the  English  physicist,  Sir  John  Cockcroft,  said  recently, 
speaking  about  the  British  atomic  reactor,  Zeta,  "I  am  90%  certain  that  the 
neutrons  were  produced  by  a  thermonuclear  reaction,"  he  meant  presumably 
that  he  would  be  willing  to  bet  as  high  as  9  to  1  that  the  reaction  was  in  fact 
thermonuclear.  When  a  man  bets  on  a  horse  which  is  quoted  at  odds  of  6  to  1 
against,  it  may  be  concluded  that  he  regards  the  probability  of  its  winning  as  at 

least  \.   I  If  /?!  =  -,  p2  =  -,  —  =  6. 1   At  least  the  "rational  man"  would  so  act. 

1.9  Interpretation  of  Probability  in  Terms  of  Relative  Frequency    If  an 

event  A  is  assigned  the  probability  /?,  in  a  particular  context,  one  interpretation 
of  this  probability  is  that  in  a  long  series  of  n  similar  trials  in  this  context,  the 
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event  would  happen  in  approximately  np  trials  and  fail  to  happen  in  the  re- 
mainder. As  already  mentioned,  this  is  a  special  case  of  the  law  of  large  numbers. 
Where  there  is  no  other  natural  or  reasonable  method  of  assigning  probabilities, 
the  experimental  method  of  counting  the  number  of  times  (r)  that  the  event 
happens,  and  using  the  approximation  r/n  for/?,  is  usually  applicable.  Insurance 
companies  estimate  the  chance  that  an  insured  person  will  survive  to  a  given 
age  by  a  study  of  the  records  of  such  persons  in  the  past.  Actually,  the  context 
has  not,  in  this  particular  example,  remained  quite  constant.  Improvements  in 
public  health,  and  new  drugs,  have  in  recent  years  greatly  increased  the  chances 
of  survival  of  individuals,  so  that  the  probabilities  assessed  from  the  records  of, 
say,  the  past  50  years  tend  to  be  too  low.  However,  there  is  no  way,  except  from 
the  records,  to  assess  these  probabilities.  The  companies  should,  of  course, 
use  the  most  recent  records  that  are  available. 

If,  as  the  outcome  of  every  trial,  we  can 
say  that  a  particular  event  A  has  or  has  not 
happened,  and  also  at  the  same  time  that 
another  event  B  has  or  has  not  happened, 
there  are  clearly  four  possibilities  altogether 
as  regards  the  two  events  jointly,  namely, 
the  events  denoted  by  A  n  By  A  n  B,  A  n  B, 
and  A  n  B.  These  are  mutually  exclusive, 
or  disjoint. 

If  the  corresponding  frequencies  for  these  compound  events  are  a,  b,  c,  and 
d,  (the  sum  of  all  these  being  n),  the  relative  frequencies  are  a/n,  b/n>  etc.  The 
relative  frequency  of  A  is  (a  +  b)/n  =  r1/ni  where  rl  is  the  total  frequency  in  the 
first  row  of  the  two-by-two  table  (Figure  6),  since  both  events  A  n  B  and  A  n  B 
imply  that  A  happens.  If  we  denote  the  relative  frequency  of  A  byf(A),  then 
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Fig.  6    Two-by-two  frequency 
table 


(1-9.1) 


f(A)  =f(A  n  B)  +f(A  n  B). 


The  event  A  u  B  includes  all  cases  in  which  either  A  or  B  (or  both) 
happen,  that  is,  it  includes  A  n  B,  A  n  B  and  A  n  B,  but  not  A  n  B.  The  total 
frequency  for  A  u  B  is  therefore  a  +  b-\-c  =  rl+cx—  a>  where  rx  is  the  total 
frequency  in  the  first  row  and  cl  is  the  total  frequency  in  the  first  column.  Since 
f(B)  =  cJh,  we  have  the  rule 


(1.9.2) 


f(A  u  B)  =f(A)  +/(B)  -f(A  n  B) 


If  we  regard  the  relative  frequencies  (for  large  n)  as  approximations  to  the 
corresponding  probabilities,  we  arrive  at  the  basic  law  for  probabilities  given 
in  (1.7.3). 


1.10  Conditional  Probability    In  the  table  of  Figure  6,  the  event  A  happens 
in  rt  cases  altogether,  and  in  a  of  these  cases  the  event  B  also  happens.  We  can 
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therefore  state  that  the  relative  frequency  of  B,  when  it  is  known  that  A  happens, 
is  given  by 

(1.10.1)  f(B\A)  =  a/r,  =  ajn  -  rjn  =f(B  n  A)/f(A), 

where  it  is  assumed,  of  course,  that  rl  is  not  zero. 

This  rule  also  can  be  assumed  to  apply  to  probabilities,  and  we  can  in  fact 
define  the  conditional  probability  of  B,  given  that  A  happens,  by 

(1.10.2)  P(B\A)  =  P{Bp^\        P(A)  *  0 
In  the  same  way, 

(1.10.3)  P(A\B)  =  ^~,        P(B)  *  0 

The  two  events  A  and  B  are  said  to  be  independent  if 

(1.10.4)  P(A  nB)=  P(A)-P(B) 

If  this  is  so,  then  from  (2)  and  (3)  P(B\A)  =  P(B)  and  P(A\B)  =  P(A).  This 
means  that  the  probability  of  A  or  of  B  does  not  depend  at  all  upon  whether 
the  other  happens,  in  agreement  with  the  intuitive  idea  of  independence. 
Equation  (4)  is  often  called  the  multiplication  law  for  probabilities. 

With  more  than  two  events  the  situation  becomes  rather  complicated.  Three 
events  A,B,C  are  independent  if  each  pair  (AB,  AC  and  BC)  are  independent  and 
if  also 

(1.10.5)  P(An  B  nC)=  P(A)-P(B)-P(C). 

This  implies  that  four  probability  conditions  have  to  hold.  These  may  be 
P(A\B)  =  P(A),  P(A\C)  =  P{A),  P(B\C)  =  P(B)  and  P(C\A  n  B)  =  P(C). 
There  are  also  five  other  conditions  (given  by  interchanging  the  letters  A,  B, 
C)  which  hold  when  the  first  four  hold. 

The  general  relationship  which  replaces  (5)  when  A,  B,C  are  not  independent 
may  be  written 

(1.10.6)  P(An  B  nC)  =  P{A)-P{B\A)-P(C\A  n  B),  P(A)  ^  0,  P(A  n  B)  #  0 

The  fact  that  three  events  may  be  pairwise  independent  without  being 
completely  independent  may  be  illustrated  by  the  following  example  [8].  Imagine 
four  similar  discs  in  a  bowl,  numbered  respectively  112,  121,  211,  222,  and 
suppose  one  disc  is  picked  at  random.  Let  the  events  A,  B,  C  be  "the  first  digit 
on  the  disc  picked  is  1,"  "the  second  digit  on  this  disc  is  1,"  and  "the  third  digit 
on  this  disc  is  1."  Then  it  is  easy  to  see  that  P(A)  =  P(B)  =  P(C)  =  i,P(A  n  B) 
=P(A  nC)  =  P(B  nC)  =  i,  but  P(A  n  B  n  C)  =  0.  The  three  pairs  AB, 
AC,  and  BC  are  therefore  independent  but  the  condition  (5)  is  not  satisfied,  so 
that  A,  B,  C  are  not  all  three  independent. 
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1.11  Elements  of  Combinatorial  Analysis  Most  ordinary  calculations  in 
elementary  probability  are  based  on  the  assumption  that  every  simple  element 
of  the  finite  set  %  of  all  possible  outcomes  has  the  same  weight.  The  probability 
measure  of  an  event  A  is  therefore  proportional  to  the  number  of  elements  of  $11 
included  in  A.  Thus  to  find  the  probability  of  the  event,  "a  total  of  1 1  spots  with 
two  dice,"  we  assume  that  all  of  the  36  ordered  pairs  of  numbers  which  can 
represent  the  fall  of  the  two  dice  have  an  equal  weight.  Two  of  these,  namely 
(6,  5)  and  (5,  6),  correspond  to  a  sum  of  1 1,  and  the  probability  of  this  event,  on 
the  basis  of  our  assumption,  is  therefore  2/36.  In  more  complicated  situations 
the  calculation  of  the  number  of  elements  included  in  a  particular  subset 
of  ^U  will  often  involve  the  mathematics  of  permutations  and  combinations. 
We  therefore  recall  briefly  a  few  definitions  and  theorems,  without  giving  proofs 
of  the  latter. 

Theorem  1.9  The  number  of  ordered  arrangements  (permutations)  of  n 
distinguishable  objects  is  1  ■  2-  3  .  .  .  -n.  This  number  is  denoted  by  n!  (read 
"factorial  /?"). 

Theorem  1.10  The  number  of  ways  of  selecting  and  arranging  in  order  r  out 
of  n  distinguishable  objects  is  n!l  (n  —  r)!.  This  is  often  denoted  by  (n)r,  a  notation 
due  to  Feller  [9].  When  r  =  n,  the  result  should  reduce  to  n\,  so  that  we  must 
agree  to  define  0 !  as  1 . 

Theorem  1.11  The  number  of  ways  of  arranging  in  order  n{  objects  all  alike 
of  one  kind,  n2  all  alike  of  a  second  kind,  and  so  on,  up  to  k  kinds  of  objects,  is 

n!/(nl!  n2! .  .  ■  nk!),  where  2,nt  =  n. 

Theorem  1.12  The  number  of  ways  of  picking  r  out  of  n  distinguishable 
objects,  regardless  of  the  order  in  which  they  are  arranged,  is  called  the  number  of 

(n\             n !  (n\ 

= =  (n)rjr\.  The  symbol         may  be 
rj       r\(n-r)\  \rj 

read  "n  above  r."   Other  notations  such  as  C(n,  r)  or  "Cr  are  also  met  with,  but 

the  on 3  used  here  seems  to  be  increasingly  common.    From  the  definition  it 

follows  immediately  that  (    J  =  j    1=1.  The  symbol  I    I  for  r  >  n  is  defined 

asO. 

Since     I  "l  =  n(n  —  1)0?  —  2)  .  .  .  (n  -  r  +  1  )//*!,    the    notation    can    be 

extended  by  writing 

("")  =(~n)(-n  -  lX-n  -  2)  .  .  .  (-«  -  r  +  l)/r! 

=  (-l)rn(n  +  l)(n  +  2)  .  .  .  (n  +  r  -  l)/r! 

=(-i)f+^-1). 


1.12  PROBABILITY  13 

Theorem  1.13     The  binomial  theorem  for  a  positive  integral  index  may  be 

written 

n(n  —  1)    , 
(1  +  xf  =  1  +  nx  +  x2  +  .  .  .  +  x" 

n 


since 


-sir 

all  the  coefficients  I    I  vanish  when  r  >  n. 


Theorem  1.14  The  binomial  theorem  for  a  negative  integral  index  may  be 
written 

=.!<->'("  TV 

The  two  theorems  1.13  and  1.14  are  therefore  formally  identical  with  the 
substitution  of  —n  for  n.  The  first  expansion  however  corresponds  to  a  finite 
number  of  terms  and  the  second  to  an  infinite  series. 

1.12  Sampling  from  a  Finite  Population  The  process  of  picking  a  set  of  r 
objects  out  of  a  given  set  of  n  objects  is  often  called  sampling.  The  n  objects 
constitute  a  "population,"  or  "universe,"  and  the  r  objects  constitute  the  sample. 
If  every  object  in  the  population  has  an  equal  chance  to  be  picked  for  a  particular 
sample,  this  sample  is  said  to  be  random.  Other  kinds  of  sampling  are  of  course 
possible.  For  instance,  we  could  arrange  the  population  in  some  order  and  pick 
every  tenth  object.  This  would  be  a  systematic  sample.  In  most  problems  of 
statistical  inference,  where  it  is  required  to  infer  properties  of  a  population  from 
those  of  a  sample,  it  is  understood  that  the  sampling  is  random.  Sometimes,  for 
convenience  or  even  for  increased  accuracy,  some  special  scheme  of  sampling 
may  be  adopted,  but  if  we  are  to  make  valid  inferences  from  the  sample  to  the 
population  there  must  be  an  element  of  randomness  about  the  choice  of  the 
sample.  Sampling  procedures  are  discussed  more  fully  in  Chapter  7. 

There  are  two  ways  in  which  we  can  choose  our  random  sample.  We  may 
pick  an  object,  make  a  note  of  it,  and  put  it  back  before  picking  the  next  object. 
This  is  called  "sampling  with  replacements,"  and  of  course  implies  that  the  same 
object  can  appear  more  than  once  in  the  same  sample.  In  fact,  since  there  are  n 
possible  choices  for  each  item  in  the  sample,  the  number  of  ways  of  picking  the 
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sample  (taking  the  order  into  account)  is  nr.  On  the  other  hand,  when  an 
object  is  picked  for  the  sample  it  may  be  put  on  one  side  and  removed  from  the 
population.  It  is  not  then  available  to  be  picked  again  in  the  same  sample.  This 
is  the  situation  envisaged  in  Theorems  1.10  and  1.12,  and  is  called  "sampling 
without  replacements."  The  number  of  ways  of  picking  the  sample  (taking  order 
into  account)  is  now  (n)r  =  n{n  —  1)  ...(«  —  r  +  1). 

Example  1  The  number  of  possible  five-digit  numbers  (including  those 
beginning  with  one  or  more  zeros)  is  105.  The  number  with  all  five  digits  different 
is  (10)5  =  30,240.  The  probability  that  a  five-digit  number  selected  at  random 
will  have  all  its  digits  different  is  therefore  0.3024. 

Example  2  The  probability  that  in  a  class  of  25  students  no  two  will  have 
the  same  birthday  is,  by  a  similar  argument,  (365)25/(365)25.  On  the  assumption 
that  all  days  in  the  year  are  equally  likely  as  birthdays  (and  ignoring  leap  years), 
the  number  of  possible  arrangements  of  birthdays  among  the  25  people  is  (365)25 
and  the  number  of  arrangements  with  all  birthdays  different  is  (365)25.  The 
probability  may  be  written  as 


H'-iH'-Js)-!1-:!) 


As  a  rough  approximation,  writing  loge(l  —  x)  «  —  x,  we  have 

logeP,-1+2^5-+24= -0.823 

giving  p  «  0.44.  The  exact  result,  which  may  be  evaluated  by  means  of  a  table  of 
logarithms  of  factorials  (e.g.,  Glover's  Tables  [10]  or  Biometrika  Tables  [11]), 
is  0.4315.  It  is  rather  surprising  to  most  people  that  the  chance  of  at  least  two 
coincident  birthdays  in  a  group  of  this  size  should  be  as  high  as  it  is,  namely 
0.5685  (=  1  -  p). 

Example  3  The  probability  of  holding  precisely  three  aces  in  a  hand  at 
bridge  is  the  ratio  of  the  number  of  possible  hands  containing  three  aces  to  the 
number  of  hands  altogether.  The  basic  assumption  is  that  every  completely 
specified  hand  of  1 3  cards  that  can  be  dealt  from  a  deck  of  52  cards  is  just  as 
likely  as  every  other,  which  is  probably  reasonable  if  the  deck  is  more  thoroughly 
shuffled  than  is  customary  in  actual  play. 

The  total  number  of  possible  hands  is  I      I ,  which  is  a  very  large  number, 


about  635  billions.  The  number  with  three  aces  is  given  by  multiplying  the 
number  of  ways  of  picking  the  three  aces,  namely  I  _  ) ,  by  the  number  of  ways 
of  picking  the  10  other  cards  in  the  hand  out  of  the  48  cards  in  the  deck  which 
are  not  aces,  namely  I  I .  The  required  probability  is  therefore  L  hn  /  LL 
which  reduces  to  0.0412. 
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Example  4  Why  does  it  pay,  in  the  long  run,  to  bet  even  money  on  seeing 
6  at  least  once  in  four  rolls  of  a  die,  but  not  to  bet  even  money  on  seeing  double-6 
at  least  once  in  24  rolls  with  two  dice  ? 

This  problem  was  solved  early  in  the  history  of  probability  theory.  It  was 
posed  by  the  Chevalier  de  Mere,  a  courtier  and  amateur  mathematician  at  the 
French  court  around  1650,  to  the  celebrated  mathematician  Blaise  Pascal.  Since 
the  chance  of  6  in  a  single  throw  with  one  die  is  six  times  as  great  as  the  chance  of 
double-6  with  two  dice,  it  seemed  only  natural  to  him  that  the  number  of  throws 
necessary  for  an  even  chance  of  seeing  the  second  event  should  be  just  six  times 
the  number  necessary  for  the  same  chance  of  seeing  the  first  event.  However, 
calculations  did  not  seem  to  bear  this  out,  and  Pascal  was  able  to  show  him  that 
his  supposition  was  in  fact  not  true. 

It  is  easier  to  find  the  probability  of  the  complementary  event,  no  6  at  all, 
than  that  of  at  least  one  6.  The  probability  of  not  seeing  6  in  a  single  roll  is  5/6, 
and  if  the  rolls  are  independent  the  probability  of  not-6  on  four  successive  rolls 

is  (-)  .  The  probability  of  at  least  one  6  is  therefore  1  —  (-)    =0.516,  which 


is  a  better  than  even  chance. 

Similarly,  the  chance  of  not  seeing  double-6  in  all  24  successive  rolls  with  two 
dice  is  (35/36)24  and  the  chance  of  at  least  one  double-6  is  therefore  1  —  (35/36)24 
=  0.491.  This  is  a  less  than  even  chance.  The  two  chances  are  so  nearly  equal, 
however,  that  it  seems  unlikely  that  the  difference  could  be  detected  empirically 
in  ordinary  play. 

Example  5  A  card  is  drawn  from  an  ordinary  deck,  looked  at,  and  replaced, 
and  the  deck  is  shuffled.  How  many  times  should  this  be  done  in  order  to  have  a 
90  %  chance  of  seeing  the  ace  of  spades  at  least  once  ? 

The  same  argument  as  in  Example  4  leads  to  the  conclusion  that  the  chance  of 
not  seeing  the  ace  of  spades  in  n  successive  draws  is  (51/52)".  If  this  is  put  equal 
to  0.1,  we  have  an  equation  for  n. 

Inverting  both  sides  and  taking  common  logs,  we  obtain 

n(log  52  -  log  51)  =  log  10  =  1 

so  that  n  =———  =  119. 
0.0084 

In  problems  such  as  this,  n  must  necessarily  be  an  integer,  so  that  the 
probability  cannot  always  be  adjusted  exactly  to  a  pre-assigned  value.  By  taking 
the  next  highest  integral  value  of/?,  we  ensure  that  the  probability  will  be  at  least 
equal  to  the  value  given. 

1.13  The  Indicator  Function  The  idea  of  a  function  plays  an  important  role 
in  probability  theory.  A  function  is  a  rule  which  takes  us  from  one  set  (called  the 
domain  of  the  function)  to  another  set  (called  the  range).  To  each  element  of 
the  domain  the  function  assigns  one  element  of  the  range.  Thus  the  function  x2 
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may  have  as  domain  the  whole  set  of  real  numbers,  positive  and  negative  and 
zero,  and  if  so  its  range  is  the  set  of  positive  real  numbers  and  zero.  To  every 
value  of  x  there  is  just  one  value  of  x2.  The  function  is  said  to  be  defined  on  the 
domain. 

In  probability  theory,  a  function  may  be  defined  on  the  set  of  points  such  as 
a{  which  belong  to  the  universal  set  °U  of  all  possibilities  (within  the  given 
context).  If  its  range  consists  of  the  set  R  =  (rl9  r2  .  .  .  rk)  and  if  the  function 
assigns  to  each  ai  the  value  rJt  we  may  write  the  functional  relation  2Lsf(at)  =  r}. 

It  often  happens  that  the  function  assigns  to  several  elements  of  °U  the  same 
element  rj.  If  so,  we  can  define  a  probability  measure  on  the  space  R  by  giving  to 
each  element  rj  the  sum  of  the  weights  of  all  the  points  a(  which  are  such  that 
/(#;)  =  rj.  If  A  denotes  the  set  of  all  these  points  ah  then  P(A)  is  the  weight 
attached  to  rJt  where'P  is  the  original  probability  measure  on  tfl.  The  measure 
defined  in  this  way  on  R  is  called  the  measure  induced  by  the  function  f. 

A  particularly  simple  and  useful  example  of  a  function  is  the  indicator  function 
IA)  defined  on  the  whole  space  °U.  This  has  just  two  values  in  its  range,  1  when 
the  point  ax  belongs  to  A  (a  subset  of  °ll)  and  0  when  it  belongs  to  A.  The  indicator 
function  corresponding  to  an  event  A  may  be  thought  of  as  1  whenever  A  happens 
and  0  when  it  does  not  happen. 

The  indicator  function  of  the  event  A  n  B  is  given  by 

(113.D  IAnB=hh 

since  IA  and  IB  are  both  not  zero  only  for  points  lying  in  the  intersection  of  A 
and  B.  Similarly  for  the  union  of  A  and  B, 

(113.2)  IA.B=lA+lB-hnB 

as  is  easily  verified  by  checking  the  values  of  the  right  hand  side  for  the  different 
regions  making  up  A  u  B  in  a  Venn  diagram. 

If  the  whole  space  °U  is  partitioned  into  a  set  of  disjoint  events  Al9  A2t  .  .  .  Ani 
and  if  a  function  X  is  defined  on  all  points  of  °U  by  the  relation  X(AJ)  =  xj9 
where  xlf  x2  .  .  .  xn  are  real  numbers,  then  X  \s  called  a  simple  random  variable, 
or  a  variate.  From  the  definition  of  the  indicator  function,  it  follows  that 

(1.13.3)  X=  £XjIAj 

7=1 

Thus  for  two  dice  the  space  of  possibilities  consists  of  36  points  (Figure  1). 
If  the  variate  X  is  the  sum  of  spots  shown  by  the  two  dice,  Xj  may  take  any  one 
of  the  eleven  values  2,  3,  4  .  .  .  12.  The  set  Ax  consists  of  the  single  point  (1,  1). 
The  set  A2  consists  of  two  points  (1,2)  and  (2,  1),  and  so  on.  A  variate  may  thus 
be  thought  of  as  a  mapping  from  the  space  of  possibilities  °ll  (the  domain)  into 
the  axis  of  real  numbers  (the  range).  All  the  elements  of  Al  are  mapped  into  the 
point  xlt  all  the  elements  of  A2  into  the  point  x2,  and  so  on.  See  Figure  7.  We 
shall  for  the  most  part  adhere  to  the  convention  of  representing  a  variate  by  a 
capital  letter  such  as  Xsaid  the  numerical  values  in  its  range  by  small  letters  such 
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as  x.    We  can  then,  for  instance,  speak  of  the  probability  that  X  takes  the 
value  x,  or  the  set  of  values  between  xx  and  x2. 

This  distinction  between  a  variate  X  and  a  numerical  value  x  (which  the 
variate  may  take)  is  one  which  the  student  should  try  to  get  clear  in  his  mind. 
The  variate  is  a  function  on  the  space  of  events  to  the  real  axis.  It  associates  with 
each  possible  event  Aj  a  real  number  xj9  which  may  be  the  obvious  number 
connected  with  the  event  (as  in  the  example  of  the  sum  of  spots  shown  by  two 


Fig.  7    Variate  X  as  a  mapping  into  the  real  axis 

dice)  or  a  more  or  less  arbitrary  number  (as  when  we  denote  a  male  birth  by  1 
and  a  female  birth  by  0).  The  domain  of  A'is  the  set  of  all  possible  events  Aj  in  the 
particular  context  considered,  and  the  range  of  X  is  the  corresponding  set  of  real 
numbers.  The  range,  of  course,  may  include  only  a  discrete  set  of  numbers  or 
may  include  all  real  numbers  in  a  finite  interval  or  even  the  whole  of  the  real 
axis.  The  domain  is  often  called  the  "sample  space"  or  the  "possibility  space." 
A  point  in  this  space  (or  a  set  of  points)  is  a  possible  event  of  the  type  considered. 


1 .14  Expectation    If  P(Aj)  is  the  probability  of  the  event  Aj9  the  expectation 
of  the  variate  X  is 


(1.14.1) 


E(X)=  Y,*Aaj) 


Example  6  If  10,000  tickets  are  sold  in  a  lottery  and  if  there  is  one  prize  of 
$1000  and  ten  prizes  of  $50,  what  is  the  expectation  of  the  worth  of  a  single 
ticket? 

If  X  is  the  worth  of  a  ticket  in  dollars  it  is  a  variate  which  takes  three  values, 
namely,  1000,  50  and  0.  The  probabilities  corresponding  to  these,  on  the 
assumption  that  the  winning  tickets  are  picked  by  a  purely  random  process,  are 
1/10,000,  1/1000  and  9989/10,000  respectively.  The  expectation  of  Xte  therefore 
$(1000/10,000  +  50/1000),  or  15  cents.   If  the  price  of  the  ticket  were  15  cents 
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the  lottery  would  be  "fair,"  in  the  sense  that  the  price  would  be  equal  to  the 
expectation.  Actual  lotteries  and  gambling  games  are  not  fair  in  this  sense,  since 
a  substantial  percentage  of  the  money  raised  goes  to  the  organizers  (or  the 
"bank")  and  is  not  available  for  prizes. 

Example  7  Johnny  is  collecting  a  set  of  12  kinds  of  coupon,  one  coupon 
being  found  in  each  packet  of  a  particular  breakfast  cereal.  If  the  family  buys 
a  new  packet  on  Monday  of  each  week,  how  long  should  he  expect  to  have  to 
wait  before  the  set  is  complete  ?  It  is  assumed  that  the  different  kinds  of  coupon 
are  distributed  at  random  in  packets  of  the  cereal. 

Suppose  that  on  a  particular  Monday,  after  opening  the  new  packet,  Johnny 
has  collected  x  different  kinds  of  coupon  (1  <  x  <  11).  The  chance  that  the 
packet  to  be  opened  in  one  week's  time  contains  one  of  the  kinds  he  already  has 
is  x/\2,  and  the  chance  that  it  has  a  new  kind  is  1  —  x/\2.  The  chance  that  he 
has  to  wait  two  weeks  before  getting  a  new  kind  is  x/ 12(1  —  x/12).  The  chance 
that  he  has  to  wait  r  weeks  and  then  gets  a  new  kind  is  (jt/12)r_1(l  —  x/12). 
Denoting  this  probability  by  p(r,  x),  we  obtain  as  the  expectation  of  r  for  a  given 
x  the  expression 

^""•"-('-fj)?'©'"' 

The  total  expected  time  before  the  set  is  complete  is  therefore 

?ir^  =  12(rr+fo  +  ---  +  1)=36-2weeks 

Theorem  1.15    If  I A  is  the  indicator  function  for  the  set.  A,  E(IA)  =  P(A). 
By  the  definition  of  IA  it  takes  only  the  two  values,  1  for  A  and  0  for  A.  Its 
expectation  is  therefore  1  -P(A)  +  0-P(A). 

Theorem  1.16  If  X  and  Y  are  variates  defined  over  the  same  space  Ql  of 
possibilities,  E(X  +  Y)  =  E(X)  +  E(Y). 

If  the  space  °lf  is  subdivided  into  disjoint  sets  AfJ  —  I,  2  ...  n)  and  also  into 
disjoint  sets  Bk(k  =  1,2...  m),  then,  by  Eq.  (1), 

E(X)  =  X  XjP(Aj\        E(Y)  =  X  ykP(Bk) 

Now  the  event  Aj  may  be  separated  into  disjoint  sets  Aj  n  Bu  Aj  n  B2  .  .  . 
Aj  n  Bm  (see  Figure  8),  so  that  Aj  =  £fc  Aj  n  Bk.  Similarly,  Bk  =  £,•  Aj  n  Bk. 
Therefore, 

E(X)  +  E(Y)  =  £  Xjp(z  Aj  n  B,)  +  £  ykp(^  A,  n  B*) 
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P(l  AJ  nB*)=%  P(AJ  n  **>•    and    P(l  AJ  nB*)=X  P(Aj  n  Bv 


Hence, 


E(X)  +  E(Y)  =  X  X  (x,  +  yft)P04,.  n  Bft). 

J     k 


By  definition,  £(X  +  Y)  =  £,£*  (*j  +  ykWiAj  n  #*)  since  for  all  points  in  the 
intersection  of  Aj  and  Bk,  the  variable  X  +  Y  has  the  value  Xj  +  >>fc. 
The  result  can  be  extended  to  any  finite  number  of  variates. 

1.15  Independent  Variates  We  have  defined  the  concept  of  independence 
for  events,  in  §  1.10.  Classes  of  events  stf, 
^,  #  .  .  .  are  independent  if  Aj9  Bk,  Cx .  .  . 
are  independent,  where  Ai  is  any  member 
of  sf,  Bk  any  member  of  ^,  and  so  on. 
Random  variables  X,  Y,  Z  .  .  .  are  indepen- 
dent if  the  partitions  of  °U  on  which  they 
are  defined  are  independent,  that  is,  if 

x  =  S  */v  y  =  Z  ^7Bk» etc>  wnere  each 
Aj  is  independent  of  each  Bk,  etc. 


*2 

Bi/jg£tf'&2>: 

y\Afift&=k 

i-AynBg 

Bs  7^^ 

M 

/  *4    ^Ay 

ni^\ 

Fig.  8    Disjoint  subsets  of  a  set  a 
Theorem  1.17    If  X  and  Y  are  independent, 


E(XY)  =E(X)-E(Y) 

By  definition,  E(XY)  =  £7-  ]Tk  XjykP(Aj  n  i?fc).  Since  ,4,-  and  2?fc  are  independent, 
P(Aj  n  £*)  =  P(Aj)P(Bk).  Therefore, 


=  E(XyE(Y). 
This  result  also  can  be  extended  to. any  finite  number  of  variates. 

1.16  Continuous  Probability  In  some  problems  we  cannot  divide  the  whole 
space  of  possibilities  %  into  a  finite  (or  even  a  denumerable)  set  of  regions  Aj 
in  each  of  which  X  takes  a  definite  value  Xj.  What  happens  is  that  X  varies 
continuously  over  some  interval  and  there  is  a  probability /(x)  dx  that  it  takes  a 
value  between  x  and  x  +  dx.  In  such  a  case/(x)  is  called  a  probability  density, 
or  simply  a  density.  It  is  a  single-valued,  non-negative  function,  integrable  over 
its  whole  domain  of  definition  (say  from  x  =  a  to  x  =  6),  and  such  that 

\[f(x)  dx  =  1. 
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It  is  customary  to  take  the  domain  of  x  as  the  whole  real  axis  from  —  oo  to 
+  00,  and  to  putf(x)  =  0  over  any  intervals  which  correspond  to  impossible 
values  of  X.  It  may  happen,  for  instance,  that  X  is  by  nature  non-negative,  in 
which  case/(X)  =  0  for  all  negative  values  of  x. 

If  the  event  A  corresponds  to  a  value  of  x  between  a  and  /?,  the  probability  of 
A  is  defined  by 


P(A) 


f{x)  dx 


Note  that  here  the  function /(jc),  defined  on  the  axis  of  real  numbers,  induces  a 
measure  on  %,  the  measure  of  the  interval  dx  being/  (x)  dx.  The  probability  that 
X  <  x  is  .given  by 


(1.16.1) 


F(x) 


J  —  oo 


f(u)  du 


This  function,  F(x),  is  called  the  cumulative  distribution  function  (often  simply  the 
distribution  function)  corresponding  to  the  density  f(x).  F(x)  is  a  non-negative, 
never-decreasing  function,  with  values  lying  between  0  and  1,  inclusive. 

For  a  discrete  distribution,  in  which  X  takes  the  distinct  values  Xj  (j  =  1,  2, 
3  .  .  . )  with  probabilities /(xy),  the  distribution  function  is  defined  as 

(1.16.2)  F(x)  =  £  f(xj) 

Xj<X 

This  is  a  step-function  (see  Figure  9a).  It  increases  by  a  finite  amount  /  (x^  at 
each  point  xjf  but  remains  constant  in  between.  At  each  point  xjt  F(x)  has  the 
value  at  the  tor^  of  the  riser,  and  so  is  continuous  on  the  right  but  not  on  the  left. 
Figure  9b  shows  a  typical  distribution  function  for  a  continuous  variate. 


x1  x2  x3  xA   x5  Xq  x7 
Fig.  9    (a)  Discrete 

DISTRIBUTION  FUNCTION 


1 

t 

F(x) 

f 

0 

a                        I 

Fig.  9    (b)  Continuous 

DISTRIBUTION  FUNCTION 


Example  8  (Bertrand's  problem)  What  is  the  probability  that  if  a  chord  is 
drawn  at  random  in  a  circle  of  radius  a,  the  length  of  the  chord  will  be  greater 
than  a? 

This  illustrates  a  difficulty  that  often  arises  in  such  problems.  There  is  no 
unique  answer  unless  the  words  "at  random"  are  more  carefully  defined. 
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Suppose  we  pick  a  point  A  anywhere  on  the  circle  and  draw  the  diameter  AOA' 
through  A.  Any  chord  AB  will  make  an  angle  6  with  A  A'  which  lies  between 
-n/2  and  n/2.  If  this  angle  is  between  —  rc/3  and  n/3  (so  that  AB  is  between  4C", 
and  AC),  the  condition  /  >  a  is  satisfied.  The  probability  required  is  therefore 

f*/3  /(0)  Ml\n'l2AQ)  de>  where  /(0)  is  the  probability  density  for  6.  On  the 
assumption  that  all  possible  values  of  0  are  equally  likely, /(0)  =  1/xc,  and  the 
probability  reduces  to  2/3. 


2 

V 

t 

a 

+ 

*>& 

0 

Fig.  10    Bertrand's  problem 


Fig.  1 1     Buffon's  problem 


Another  possible  procedure  for  drawing  the  chord  "at  random"  would  be 
to  select  A  as  before  and  pick  a  point  D  on  AO.  The  perpendicular  EF  to  AO 
passing  through  D  defines  a  chord  of  the  circle.  If  OD  =  x,  EF  will  be  greater 
than  a  whenever  x  <  \3a/2.  If  the  probability  density  for  x  isf(x),  the  required 

probability  is  °  f(x)  dx/  f(x)  dx.  On  the  assumption  that  all  possible 
values  of  x  are  equally  likely,  f(x)  =  \/a,  and  the  probability  reduces  to 
V3/2  =  0.866. 

The  two  answers  correspond  to  different  induced  measures,  depending  on  the 
way  we  conceive  the  chord  drawn  at  random.  It  would  not  be  possible  to  settle 
the  question  by  resort  to  experiment,  because  in  devising  any  experimental 
set-up  for  drawing  a  random  chord  (such  as  tossing  a  straight  piece  of  wire, 
longer  than  the  diameter  of  the  circle,  on  to  the  table-top  on  which  the  circle  is 
drawn)  we  would  need  to  choose  one  particular  interpretation  of  the  random 
process. 

Example  9  (Buffon's  problem)  Parallel  straight  lines,  a  distance  a  apart, 
are  ruled  on  a  horizontal  table.  A  straight  piece  of  wire,  or  needle,  of  length 
/  <  a,  is  tossed  at  random  on  to  the  table.  What  is  the  probability  that  it  crosses 
a  line  ? 

If  we  take  the  x-axis  along  one  of  the  lines  and  the  j/-axis  perpendicular,  it 
is  easy  to  see  that  the  x-coordinate  of  the  centre  of  the  needle  is  immaterial.  It 
is  the  ^-coordinate  of  the  centre  and  the  angle  6  made  by  the  needle  with  the 
x-axis  which  matter.  (Fig.  11). 
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The  needle  will  cross  the  nearest  line  if  the  distance  y  of  its  centre  from  this 
line  is  less  than  \l  sin  9.  The  domain  of  y  is  from  0  to  a/2  and  that  of  9  from  0  to 
n.  If  the  joint  probability  density  for  y  and  9  is  f(y,  0),  the  required  probability  is 


P  = 


oj 


Y2  l  sin 


m  y)  dy  de 


rnra/: 
Jojo 


M  y)  dy  de 


The  assumption  of  random  tossing  means  that  we  regard  all  possible  values 
of  9  and  y  as  equally  likely.  If  so,/(0,  y)  is  constant  and  the  probability  becomes 


P  = 


i/iisim 


dydO 


oj 


■a/2 


dydO 


J_ 

na 


sin  9  dO  =  2\\{nd) 


This  result  suggests  an  empirical  method  of  approximating  to  the  value  of  n. 
Various  trials  of  the  method  have  been  made  from  time  to  time  (see,  for  example, 
reference  [12]). 

1.17  Random  Numbers  The  importance  of  ensuring  that  a  sample  is 
effectively  random  if  a  valid  inference  is  to  be  made  from  the  sample  to  the 
population,  has  already  been  mentioned.  Various  tests  of  randomness  have  been 
devised  and  some  will  be  mentioned  later  (see  §§  10-13  and  10-14).  The  choosing 
of  a  random  sample,  even  from  an  artificial  population  of  cards,  balls,  discs,  or 
the  like,  is  not  easy,  since  the  mechanical  shuffling  or  mixing  may  be  far  from 
adequate,  cards  may  tend  to  stick  together,  balls  may  not  be  equally  smooth,  and 
so  on.  When  it  is  necessary  to  pick  random, samples  from  a  crop  in  the  ground 
or  from  a  group  of  experimental  animals,  the  task  is  much. harder.  There  is  a 
natural  tendency  to  pick  what  seem  to  be  typical  rather  than  truly  random 
samples. 

Experience  has  shown  that  the  best  method  is  to  use  a  set  of  random  numbers. 
These  numbers  have  usually  been  obtained  by  some  mechanical  process,  such 
as  a  very  carefully  made  roulette  wheel,  and  have  been  thoroughly  tested  for 
randomness.  They  are  generally  arranged  in  sets  of  two  or  four  digits  and 
grouped  in  thousands.  A  short  extract  from  one  such  table  [13]  is  given  in  the 
Appendix,  Table  Bl.  The  largest  table  up  to  the  present  time  contains  a  million 
random  digits  [14]  and  was  prepared  because  of  the  increasing  need  for  very 
large  blocks  of  random  numbers  in  some  modern  statistical  techniques. 

Random  numbers  may  be  used  to  simulate  the  results  of  a  chance  experi- 
ment, such  as  tossing  a  coin.  Instead  of  actually  tossing  a  real  coin  (assumed  to 
have  a  probability  for  heads  equal  to  0.5),  one  may  open  a  table  of  random 
numbers,  jab  a  forefinger  anywhere  on  the  page  and  start  reading  random 
numbers  (in  pairs)  from  the  place  so  indicated.  Whenever  the  number  is 
between  00  and  49  inclusive  it  is  read  as  "head",  and  when  it  is  between  50  and  99 
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inclusive  as  "tail".  Thus  the  succession  of  numbers  55,  58,  79,  50,  56,  01,  51, 
65,  92,  32,  21,  66,  35,  18,  65,  08  .  .  .  will  be  recorded  as  the  following  sequence 
of  heads  and  tails:  TTTTTHTTTHHTHHTH....  This  arrange- 
ment, which  the  author  obtained  on  his  first  trial,  is  a  random  sequence  and 
could  easily  happen  with  an  actual  coin,  but  it  is  not  what  we  would  write  down 
naturally  if  we  tried  to  forecast  the  result  of  such  an  experiment.  Although 
random,  it  is  hardly  typical. 

In  order  to  use  random  numbers  to  draw  samples  from  a  given  population  it 
is  necessary  to  allot  numbers,  or  blocks  of  numbers,  to  elements  in  the  population. 
Thus  if  we  have  a  population  of  600  from  which  we  want  to  draw  a  sample  of  20, 
we  number  the  members  of  the  population  (the  data  for  each  individual  may  be 
entered  on  a  numbered  card,  for  example)  and  then  we  read  off  consecutive 
random  numbers  in  groups  of  three  digits,  ignoring  all  numbers  over  600  and 
disregarding  repetitions.  The  numbers  so  obtained,  such  as  284,  444,  323,  424, 
358,  521,  406,  565,  457,  078  ...  ,  represent  the  individuals  selected  for  the 
sample.  If  the  population  consists  of  several  classes,  the  members  of  any  one 
class  being  practically  alike,  and  we  want  a  sample  from  this  population,  we 
simply  have  to  know  how  many  members  (or  what  proportion  of  members)  there 
are  in  each  class  in  the  population,  and  allot  a  block  of  random  numbers  to  each  class . 
The  size  of  the  block  should  be  proportional  to  the  number  of  members  in  the 
class.  Every  random  number  that  belongs  to  a  particular  block  indicates  a 
member  of  that  class  drawn  for  the  sample.  Thus,  if  there  are  five  classes  in  the 
population,  say  A,  B,  C,  D,  E,  numbering  respectively  80,  200,  450,  240  and  30 
individuals,  we  can  allot  blocks  of  four-digit  random  numbers  as  follows: 
,4(0000  to  0799),  £(0800  to  2799),  C(2800  to  7299),  £(7300  to  9699),  £"(9700  to 
9999).  The  size  of  each  block  of  numbers  (800  for  A,  2000  for  B,  etc.)  is  pro- 
portional to  the  size  of  the  corresponding  class  in  the  population.  The  following 
set  of  random  numbers:  6469,  7152,  0256,  6137,  0458,  0968,  9610,  5778,  8500, 
8981,  would  indicate  a  sample  C,  C,  A,  C,  A,  B,  Z>,  C,  D,  D,  consisting  of 
two  A's,  one  B,  four  Cs  and  three  D's. 

Random  numbers  may  also  be  used  in  many  other  ways.  One  common 
requirement  in  experimental  work  is  to  randomize  the  order  of  a  group  of  objects, 
such  as  plots  of  land  in  a  block,  where  the  plots  are  to  have  different  treatments. 
To  randomize  nine  objects,  or  in  other  words,  to  form  a  random  permutation  of 
the  integers  1  to  9,  we  can  read  off  any  set  of  consecutive  random  digits,  ignoring 
zeros  and  repetitions.  Thus,  the  set  34766455664901566368802 
gives  the  order  34765918  2.  Modifications  of  this  method  can  be  used  with 
groups  of  larger  size.  The-  important  thing  in  randomization  is  to  use  an  im- 
personal, objective  method  and  not  to  trust  to  intuition. 

Although  Table  B.l  is  known  to  satisfy  several  tests  of  randomness,  any 
limited  collection  of  random  numbers  is  bound  to  show  some  peculiarities. 
Accordingly,  the  table  should  not  be  used  over  and  over  again.  If  very  large 
blocks  of  random  numbers  are  needed,  recourse  should  be  had  to  larger  tables 
such  as  [13]  and  [14]. 
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PROBLEMS 

A  (§§  1.7-1.10) 

1.  Suppose  that  of  a  group  of  people  surveyed,  30%  own  both  a  house  and  a  car, 
40%  own  a  car  but  not  a  house,  10%  a  house  but  not  a  car,  and  20%  own  neither. 
Illustrate  by  a  Venn  diagram.  What  is  the  percentage  that  own  either  a  house  or  a  car 
or  both  ?  What  is  the  fraction  of  car  owners  that  are  also  house  owners  ? 

2.  Let  A  stand  for  the  event  that  a  man  chosen  at  random  from  a  certain  population 
is  overweight  and  B  for  the  event  that  he  is  over  50.  Write  down  the  symbols  for  the 
probabilities  that  (a)  he  is  not  overweight,  (b)  if  he  is  overweight  he  is  also  over  50, 
(c)  if  over  50  he  is  not  also  overweight. 

3.  If  P(A)  =  i  P(B)  =  i  and P{A  U  B)  =  1 1/1 2,  what  is  P{A  fl  B)  ?  Find  also  P(A  |  B) 
and  P(B\A). 

4.  If  A  and  B  are  independent  events,  and  if  P(A)  =  I  and  P(B)  =  f ,  what  is 
P(A  U  £)?  ifi/if:  Use"  Eq.  (1.10.4). 

5.  Prove  from  the  definition  of  conditional  probability  that  P(A  C\  BH  C)  =  P(A)  ■ 
P(B\A)  -P(C\A  D  B). 

6.  Write  out  the  detailed  proof  of  Eq.  (1.7.1 1). 

7.  Two  good  dice  are  rolled  simultaneously.  Let  A  denote  the  event  "the  sum  shown 
is  8"  and  B  the  event  "the  two  show  the  same  number."  Find  P(A),  P(B),  P(A  fl  B), 
P(A  U  B),  P(A\B)  and  P(B\A). 

B  (§1.11) 

1.  How  many  five-digit  numbers  are  there  with  every  digit  odd?  How  many  with 
no  digit  lower  than  6? 

2.  How  many  arrangements  can  be  made  of  the  letters  of  the  word  "caught"  if  the 
vowels  are  always  together  and  in  the  same  order? 

3.  Four  strangers  board  a  bus  in  which  there  are  six  empty  seats.  In  how  many 
different  ways  can  they  be  seated  ? 

4.  Six  papers  are  set  in  an  examination,  two  of  them  in  mathematics.  In  how  many 
different  orders  can  the  papers  be  given  if  the  two  mathematics  papers  are  not  to  be 
successive? 

5.  Show  that  the  number  of  ways  in  which  p  positive  and  n  negative  signs  (p  >  0, 
0  <  n  <  p  +  1)  may  be  placed  in  a  row  so  that  no  two  negative  signs  shall  be  together 

is  lp         I.    Hint:  With  the  positive  signs  placed  in  a  row,  there  are  p  +  1  possible 

places  for  the  first  negative  sign,  p  for  the  second,  and  so  on.  The  negative  signs  are 
not  distinguishable. 

6.  Prove  that  (n\  =  ("_). 

7.  Prove  that  r("\  =  n("  ~  |V  Hence,  show  that   £  r("\  =  nln~K 

Hint:    £  (n  ~  ])  ="j  (n  ~  *).  Put  x  =  1  in  Theorem  1.13. 
r=1\r        1/       r=0\     r     j 

8.  If("2)  =  (J),  what  is /i? 

9.  At  a  long  dinner  table  the  host  and  hostess  sit  opposite  each  bther  at  the  ends. 
In  how  many  ways  can  In  guests  be  arranged  (n  on  a  side)  so  that  two  particular  guests 
do  not  sit  together?  Hint:  Place  these  two  guests  first. 

10.  Prove  that  f)  (™)^  n__  ^  =  (m  +  "),  k  <  m.  Hint:  Use  Theorem  1.13  and 
the  identity  (1  +  x)m+n  =  (1  +  x)m(l  +  x)n. 
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11.  Use  Problem  10  to  prove  that 


m 


C  (§  1.12) 

1.  If  four  cards  are  drawn  at  random  from  a  deck  of  52  cards,  what  is  the  probability 
that  there  will  be  one  card  of  each  suit  ? 

2.  A  bag  contains  nine  white  and  three  black  balls.  If  five  balls  are  drawn,  without 
replacement,  what  is  the  probability  that  at  least  two  are  black  ? 

3.  What  is  the  chance  that  a  bridge  hand  of  13  cards  contains  both  the  ace  and  king 
of  spades  ? 

4.  A  batch  of  1,000  lamps  is  known  to  have  5  %  defectives.  If  five  lamps  are  chosen 
at  random  and  tested,  what  are  the  probabilities  that  (a)  none  is  defective  (b)  there  are 
exactly  two  defectives? 

5.  A  factory  produces  screws,  put  up  in  boxes  of  100.  Boxes  are  inspected  by  taking 
20  screws  at  random  and  rejecting  the  box  if  any  defects  are  found  in  the  sample. 
What  is  the  probability  of  passing  a  box  that  contains  two  defective  screws  ? 

6.  Calculate  the  probability  of  throwing  a  6  with  an  ordinary  die  at  least  once  in 
six  trials. 

7.  A  room  has  three  lamp  sockets.  From  a  collection  of  10  light  bulbs,  of  which 
only  six  are  good,  I  select  three  at  random  and  put  them  in  the  sockets.  What  is  the 
probability  that  I  shall  have  light?  Hint:  Find  the  probability  of  not  getting  light,  i.e., 
of  selecting  three  bad  bulbs. 

8.  A  and  B  take  turns  in  throwing  two  dice,  the  winner  being  the  first  to  throw  9. 
Show  that  if  A  has  the  first  throw,  their  respective  chances  of  winning  are  in  the  ratio  9/8. 
/fwf.vlmaywinonlst,  3rd,  5th  .  .  .  throws,  and  these  possibilities  are  mutually  exclusive. 

9.  Cards  are  dealt  from  a  well-shuffled  deck  until  an  ace  appears.  Show  that  the 
probability  that  exactly  n  cards  will  be  dealt  before  the  first  ace  is  4(48)!  (51  —  n)\j 
[52  !(48  -«)!]. 

10.  (The  matching  problem).  A  man  writes  four  letters  and  addresses  four  envelopes. 
His  secretary  puts  the  letters  in  the  envelopes  at  random.  Show  that  the  probability  that 
at  least  one  letter  gets  into  its  right  envelope  is  1  —  1/(2!)  +  1/(3 !)  —  1/(4!).  Generalize 
for  n  letters.  Hint:  Use  Eq.  (1 .7. 1 2).  Let  Aj  denote  the  event  that  they"1  letter  gets  into  the 
right  envelope.  The  probability  required  is  P(Ai  U^U^U  A4).  For  n  letters  the 
result  is  close  to  1  —  e~x  =  0.632  (see  Appendix  A.l).  The  approximation  is  correct 
to  the  third  figure  for  n  >  6. 

11.  In  a  gambling  game,  a  player  may  deal  10  cards  from  a  well-shuffled  bridge 
deck  and  wins  if,  at  any  stage  of  the  dealing,  the  number  on  a  card  is  the  same  as  the 
number  of  cards  dealt.  (Face  cards  are  assigned  the  number  zero).  Find  the  probability 
that  the  dealer  will  win.  Hint:  This  is  a  slightly  more  general  matching  problem  (see 
Hint  to  Problem  10).   Show  that  P(Aj)  =  4-51  !/52!,  P(Aj  D  Ak)  =  42-50!/52!,  etc. 

12.  Ten  absent-minded  professors,  each  with  a  hat,  attend  a  meeting  and  each  man 
leaves  with  one  of  these  hats  chosen  at  random.  What  is  the  approximate  probability 
that  no  one  gets  his  own  hat?  What  is  the  probability  that  exactly  nine  men  get  their 
own  hats  ? 

13.  A  bridge  player  and  his  partner  have  nine  spades  between  them.  What  are  the 
respective  probabilities  that  the  other  four  spades  are  split  between  the  opponents  4-0 
3-1,  2-2? 

14.  Twelve  cards  have  been  dealt,  six  down  and  the  other  six  showing  a  jack,  two 
kings,  a  7,  a  5  and  a  4.  What  is  the  probability  that  the  next  card  will  be  a  4  or  less,  ace 
counting  low?  Hint:  The  six  cards  down  do  not  affect  the  answer. 
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15.  From  an  urn  containing  10  balls,  numbered  from  1  to  10,  balls  are  drawn  one 
at  a  time  and  placed  in  a  straight  row  of  holes  also  numbered  1  to  10.  If  each  ball  is 
placed  in  its  proper  hole,  what  is  the  probability  that  there  will  not  be  an  empty  hole 
between  two  filled  ones  at  any  time  of  the  drawing?  Hint:  When  k  holes  are  filled, 
there  are  two  favorable  positions  for  the  next  ball — unless  the  k  holes  include  an  end 
hole,  when  there  is  only  one  favorable  position.  Multiply  the  probabilities  of  success 
for  k  =  1  to  9. 

16.  Prove  that  the  probability  that  some  one  of  the  four  hands  of  cards  in  a  particu- 
lar bridge  deal  contains  all  13  cards  of  a  suit  is  about  1  in  40  thousand  millions.  (More 
hands  of  this  character  have  been  reported  than  would  be  expected.  This  fact  may  be 
due  to  imperfect  shuffling  in  actual  play.) 

D  (§§1.13-1.15) 

1.  A  bag  contains  five  nickels  and  a  quarter,  all  wrapped  separately  so  as  to  be 
indistinguishable.  A  boy  is  allowed  to  draw  one  coin  at  a  time  and  keep  it  until  he 
draws  the  quarter,  when  he  must  stop.  What  is  his  expectation? 

2.  A  tosses  six  pennies  and  agrees  to  pay  B  $6  if  either  six  heads  or  six  tails  appear 
and  $5  if  either  five  heads  or  five  tails  appear.  In  every  other  case  he  takes  B's  stake. 
How  much  should  this  stake  be  to  make  the  game  fair?  Hint:  If  the  stake  is  $x, 
calculate  B's  expectation  and  put  it  equal  to  zero. 

3.  A  coin  is  tossed  until  a  head  appears  and  the  number  of  tails  obtained  is  recorded. 
Find  the  probability  of  getting  x  tails  before  the  first  head,  and  the  expected  value  of  x. 
Hint:  1  +  2r  +  3f2  +  .  .  .  =  (1  -  r)~2,  r  <  1. 

4.  In  an  infinite  series  of  independent  trials  of  an  event  with  constant  probability  p 
of  success  in  a  single  trial,  what  is  the  expectation  of  the  number  of  failures  preceding  the 
first  success  ? 

5.  From  a  deck  of  13  spades  a  person  draws  cards  one  at  a  time,  replacing  each 
time,  until  he  draws  the  ace.  What  is  the  expectation  of  the  number  of  cards  drawn  ? 

6.  What  is  the  mathematical  expectation  of  the  sum  of  points  on  n  dice,  tossed  at 
the  same  time? 

7.  A  tosses  3  pennies  and  B  two,  and  the  winner  is  the  one  with  the  greatest  number 
of  heads.  The  winner  takes  the  combined  stakes.  If  there  is  a  tie  they  continue  tossing 
until  a  decision  is  reached.  How  much  money  should  A  put  up  on  a  game,  to  each 
dollar  that  B  puts  up,  to  make  the  game  fair?  (A  game  is  a -set  of  tosses  leading  to  a 
decision.  Theoretically,  a  game  might  go  on  indefinitely,  but  the  probability  of  this 
is  zero.) 

8.  (The  Petersburg  Paradox).  A  tosses  a  coin  repeatedly,  having  agreed  to  give  B 
%2n  if  n  tails  appear  before  the  first  head.  (Thus  B  receives  $1  if  the  first  toss  is  a  head, 
$2  if  one  tail  precedes  a  head,  and  so  on).  If,  however,  10  tails  appear  in  succession 
before  a  head,  the  game  stops  there  and  B  receives  $210.  What  sum  should  B  pay  A  for 
this  privilege?  Note:  The  paradox  arises  from  the  fact  that  if  the  game  is  allowed  to  go 
on  indefinitely,  B's  expectation  is  infinite.  This  seems  contrary  to  common  sense  and 
has  been  the  subject  of  much  discussion.  Even  in  the  limited  game,  B  would  be  foolish 
to  pay  a  sum  equal  to  his  expectation,  unless  he  intends  to  play  the  game  a  few  thousand 
times:  See,  e.g.,  [31  and  [9]. 

9.  Five  cards  are  drawn  at  random  from  a  deck  without  replacement,  looked  at, 
and  then  replaced.  This  is  done  1,000  times.  How  often  would  you  expect  to  get: 
(a)  5  of  one  suit,  (b)  4  of  one  suit,  (c)  3  of  one  suit,  2  different,  (d)  3  of  one  suit,  2  of  another, 
(e)  2  of  each  of  two  suits,  1  different,  (f)  2  of  one  suit,  3  different?  Hint:  The  expected 
number  in  each  case  is  1 ,000  times  the  probability  of  that  combination. 

E  (§  1.16) 

1.  Let  Ai,  Az,  A3,  At,  A$  be  sub-sets  of  the  two-dimensional  x-y  plane,  defined 
as  follows : 
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Ay  =  set  of  (x,  y)  such  that  x  <  2,  y  <  4,  denoted  by  {x  <  2,  y  <  4} 

/42  =  {x  <  2,  y  <  1 },     A3  =  {x  <  0,  y  <  4}, 

A4  =  {x  <  0,  y  <  1 },    ^5  =  {0  <  jc  <  2,  1  <  >  <  4}. 
Given  that  P(Ai)  =  7/8,  P(A*)  =  1/2,  P(Az)  =  3/8,  and  P(^4)  =  1/4,  find  P(/i5). 
Illustrate  by  a  diagram. 

2.  A  circle  of  diameter  8  in.  is  drawn  in  the  interior  of  a  square  of  side  12  in.  A 
penny  (diameter  |  in.)  is  dropped  at  random  on  the  square,  which  is  lying  on  a  hori- 
zontal table.  If  only  those  cases  are  counted  when  the  penny  lies  wholly  inside  the 
square,  what  is  the  probability  that  it  is  also  wholly  inside  the  circle  ?  Hint:  The  center 
of  the  penny  is  equally  likely  to  fall  anywhere  within  the  area  open  to  it.  Calculate  the 
possible  area  and  the  favorable  area. 

3.  The  floor  of  a  large  room  is  made  of  hardwood,  laid  in  strips  1  in.  wide,  with 
cracks  between  of  negligible  width.  A  coin  of  diameter  \\  in.  is  dropped  on  the  floor. 
What  is  the  probability  that  it  touches  three  strips  ? 

4.  A  third  method  of  drawing  a  chord  "at  random"  in  a  circle  (see  Example  8 
above,  §  1.16)  is  to  select  the  center  of  the  chord  at  random  and  then  draw  the  chord. 
When  the  center  is  determined,  so  is  the  whole  chord.  If  the  center  is  equally  likely  to 
be  anywhere  within  the  given  circle,  show  that  the  answer  to  Bertrand's  problem  is  f. 

5.  A  thin  stick  of  length  a  is  broken  into  three  pieces.  What  is  the  probability  that 
these  pieces  can  be  arranged  to  form  a  triangle?  Hint:  No  piece  may  be  longer  than 
a/2.  If  x,  y,  z  are  the  three  lengths,  they  satisfy  the  condition  x  +  y  +  z  =  a,  which 
represents  the  part  of  a  plane  contained  in  the  positive  octant.  Find  the  area  of  the 
part  of  this  plane  corresponding  to  the  given  condition. 

6.  A  diamond  of  value  V  is  broken  into  two  pieces.  If  the  value  of  a  diamond 
varies  as  the  square  of  its  weight,  what  is  the  expected  value  of  the  broken  diamond? 
Hint:  If  w  is  the  original  weight,  the  probability  that  one  piece  has  a  weight  x  to 
x  +  dx  is  dx/w,  0  <  x  <  w. 
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Chapter  2 

FREQUENCY   DISTRIBUTIONS,  FRACTILES 
AND  MOMENTS 


2.1  Frequency  Distribution  in  a  Sample  To  facilitate  applications  of 
statistical  inference,  the  data  supplied  by  observation  or  experiment  are  usually 
organized  and  summarized  to  expose  their  essential  characteristics.  Some  of 
the  methods  and  techniques  for  extracting  the  essential  information  supplied  by 
data,  which  are  to  be  regarded  as  constituting  a  sample,  will  be  presented  in  this 
chapter.  Incidentally,  we  might  point  out  that  statistical  inference  is  not  con- 
cerned solely  with  data  that  have  already  been  obtained.  An  important  branch 
of  modern  statistical  theory  deals  with  the  design  of  experiments  and  shows  the 
experimenter  how  to  arrange  his  work  so  as  to  be  able  subsequently  to  extract 
the  maximum  of  information  from  a  limited  amount  of  data. 

Data  obtained  in  an  experiment,  or  as  the  result  of  an  enquiry  or  question- 
naire, are  often  presented  in  a  table.  A  common  form  is  the  frequency  table,  in 
which  one  column  gives  observed  values  x  of  a  random  variable  X  and  the  other 
gives  the  frequency  with  which  each  of  these  values  was  obtained.  Recall  that  X 
is  a  function  on  the  space  of  events  to  the  real  axis.  Its  domain,  if  discrete,  is  the 
set  of  events,  Aj  and  its  range  is  the  set  of  distinct  real  numbers  Xj.  If  X  is  con- 
tinuous, its  range  usually  includes  all  real  numbers  in  some  interval.  In  this 
case  the  range  is  divided  up  into  convenient  sub-intervals  and  the  frequencies 
corresponding  to  the  various  sub-intervals  (or  classes)  are  entered  in  the  table. 

Table  2.1  records  some  data  obtained  by  D.  A.  S.  Fraser  [1]  in  tossing  a 
crudely  made  plastic  die.  The  variable  X  is  here  the  number  of  spots  observed, 
and  is  of  course  discrete.  The  frequencies /of  the  six  values  observed  in  the  first 
400  tosses  are  given.  The  third  column  will  be  referred  to  later. 

Table  2.1 


X 

/ 

F 

1 

73 

73 

2 

83 

156 

3 

80 

236 

4 

57 

293 

5 

41 

334 

6 

66 
400 

400 

28 
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Table  2.2  similarly  presents  a  distribution  of  weights,  measured  to  the 
nearest  pound,  for  a  sample  of  1000  eight-year-old  girls  in  Glasgow.  The  weight 
may  be  regarded  as  a  continuous  variable  over  a  range  of  about  40  lb,  divided 
into  10  sub-intervals,  each  of  width  4  lb.  Since  a  measured  weight  of  39  lb 
would  be  recorded  for  any  child  between  38.5  and  39.5  lb,  the  real  limits  for  these 
sub-intervals  (often  called  the  class  boundaries)  are  27.5  lb,  31.5  lb,  35.5  lb,  and 
so  on.  The  central  values  of  the  sub-intervals  are  called  class-marks  and  will  be 
denoted  by  xc.  The  upper  class  boundaries  (ends  of  sub-intervals)  will  be  denoted 
by  xe. 

Table  2.2 


Upper  Class 

Cumulative 

Measured  Class 

Class-Mark 

Frequency 

Boundary 

Frequency 

Limits 

(**) 

(/) 

(*.) 

F 

28-31    lb 

29.5  lb 

1 

31.5  lb 

1 

32-35 

33.5 

14 

35.5 

15 

36-39 

37.5 

56 

39.5 

71 

40-43 

41.5 

172 

43.5 

243 

44-47 

45.5 

245 

47.5 

488 

48-51 

49.5 

263 

51.5 

751 

52-55 

53.5 

156 

55.5 

907 

56-59 

57.5 

67 

59.5 

974 

60-63 

61.5 

23 

63.5 

997 

64-67 

65.5 

3 
1000 

67.5 

1000 

2.2  Cumulative  Frequency  Distributions  It  is  often  convenient  to  present  the 
data  of  a  frequency  table  in  a  slightly  different  form,  recording  for  suitable  values 
of  x  the  total  number  of  items  in  the  sample  which  have  an  observed  X  equal  to 
or  less  than  x.  For  the  discrete  distribution  of  Table  2. 1 ,  the  third  column  gives 
these  cumulative  frequencies  (accumulated  by  adding  the  ordinary  frequencies 
one  by  one  from  the  top  of  the  column  downwards).  They  are  denoted  by  F 

For  a  grouped  distribution  like  that  of  Table  2.2,  it  is  usually  convenient  to 
choose  the  upper  class  boundaries  xe  as  the  selected  values  of  x  corresponding  to 
F.  There  will  be  no  measured  values  actually  coinciding  with  xe  (since  the 
measured  values  are  all  recorded  to  the  nearest  unit  and  the  xe  to  half  a  unit)  and 
therefore  the  cumulative  frequency  gives  the  number  of  items  with  X  less  than  xe. 


2.3  Graphical  Representation  A  frequency  table  for  a  discrete  distribution 
may  be  represented  graphically  by  drawing  ordinates  equal  to  the  frequency  on  a 
convenient  scale  at  the  various  values  of  x.  Thus  Figure  12  corresponds  to 
Table  2.1.  The  tops  of  the  ordinates  may  be  joined  by  straight  lines,  but  these 
are  merely  to  assist  the  eye  and  have  no  significance  at  intermediate  values  of  x. 
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0  12  3  4  5  6 

Fig.  12    Frequency  graph  for  discrete  variate 

A  continuous  distribution,  grouped  in  classes,  is  usually  represented 
graphically  by  a  histogram,  as  in  Figure  13  (for  the  data  of  Table  2.2).  The 
rectangles  are  drawn  with  bases  corresponding  to  the  true  class  intervals  and 
with  heights  proportional  to  the  frequencies.  With  all  the  class  intervals  equal, 
as  in  this  example,  the  areas  of  the  rectangles  also  represent  the  corresponding 
frequencies.  In  some  tables  the  class  intervals  are  not  all  equal,  and  then  the 
heights  must  be  suitably  adjusted  to  make  the  areas  proportional  to  the  fre- 
quencies. 

If  the  mid-points  of  the  tops  of  the  rectangles  are  joined  by  straight  lines,  the 
result  is  a  frequency  polygon,  which  may  also  be  regarded  as  representing  the 
data.  The  frequency  polygon  for  Table  2.2.  is  shown  dotted  in  Figure  13. 
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Fig.  1 3    Histogram  and  frequency  polygon  for  continuous  variate 
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(a)  (b) 

(a)    Discrete  variate  (b)    Continuous  variate 

Fig.  14    Cumulative  frequency  polygons 

Graphs  of  the  cumulative  frequencies  in  Tables  2.1  and  2.2  are  shown  in 
Figure  14.  The  first  is  a  step-diagram,  similar  to  Figure  9a,  in  which  at  each 
value  of  x  the  frequency  of  values  equal  to  or  less  than  x  is  plotted.  In  Figure 
14b  there  are  no  measured  values  equal  to  the  values  at  the  ends  of  the  intervals, 
and  the  plotted  points  represent  the  frequency  of  values  less  than  x. 

2.4  Frequency  Curves  and  Ogives  The  observed  data  usually  refer  to  a 
sample  of  finite  size  which  may  be  considered  as  representative  of  a  very  large,  or 
practically  infinite,  population.  The  data  of  Table  2. 1  refer  to  a  sample  of  400 
tosses  out  of  the  indefinitely  large  number  of  tosses  which  could  conceivably  be 
made  with  this  particular  die,  given  unlimited  time  and  patience,  before  the  die 
finally  wears  out.  The  population  of  eight-year-old  Glasgow  schoolgirls,  from 
which  the  sample  of  Table  2.2  was  taken,  is  not  infinite  but  is  certainly  large. 
With  a  very  large  population  and  a  continuous  variate,  we  can  imagine  the  class 
intervals  as  being  very  short  while  still  containing  many  observed  values  in  each 
class  (we  must  suppose  that  the  measurements  are  correspondingly  accurate). 
The  frequency  polygon  will  then  approximate  to  a  smooth  frequency  curve  which 
represents  the  distribution  of  the  variable  X  in  the  population.  The  area  under 
this  curve,  between  two  fixed  values  a  and  b,  represents  the  total  number  of 
individuals  in  the  population  with  values  of  X  between  a  and  b.  If  instead  of  the 
total  frequency  we  consider  the  curve  as  giving  the  relative  frequency,  (the 
proportion  of  values  in  this  interval)  the  curve  then  represents  the  probability 
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density  in  the  population.  The  area  between  X  =  a  and  X  =  b  is  the  probability 
that  a  random  item  from  the  population  has  a  value  between  a  and  b. 

One  task  of  practical  statistics  is  to  determine  whether  a  given  sample  can 
reasonably  be  regarded  as  coming  from  a  population  of  a  particular  kind. 
Usually  some  plausible  assumption  is  made  about  the  form  of  the  frequency  curve 
(or  probability  density  curve)  for  the  population,  and  parameters  defining  the 
exact  shape  and  position  of  the  curve  are  calculated  so  as  to  make  it  fit  the 
frequency  polygon  for  the  sample  as  well  as  possible.  Some  test  is  then  applied  to 
find  out  whether  the  fit  can  reasonably  be  regarded  as  satisfactory.  This  process 
is  called  curve-fitting,  and  some  illustrations  will  be  given  later. 

Just  as  the  frequency  polygon  for  a  very  large  sample,  with  small  class 
intervals,  approximates  to  a  smooth  frequency  curve,  so  the  cumulative  fre- 
quency polygon  for  such  a  sample  approximates  to  a  smooth  curve  called  an 
ogive.  If  relative  frequencies  are  used,  the  ogive  becomes  identical  with  the  graph 
of  the  distribution  function  for  the  variate  X  in  the  population  (see  Fig.  9b). 

For  a  discrete  distribution,  the  relative  frequency  corresponding  to  a 
particular  value  of  X  is  an  approximation  to  the  actual  probability  of  this  value 
of  X  in  the  population  sampled.  If  some  prior  hypothesis  is  made  about  these 
probabilities  it  may  be  tested  by  noting  the  agreement  of  the  observed  relative 
frequencies  with  the  predicted  values.  We  might,  for  example,  use  the  data  of 
Table  2. 1  to  test  the  hypothesis  that  all  faces  of  this  particular  die  are  equally 
likely  to  turn  up  (see  §  10.2). 

2.5  The  Median  and  Other  Fractiles  The  median  of  a  sample  of  size  TV  is  the 
value  of  x  for  which  the  cumulative  frequency  is  equal  to  N/2.  In  other  words,  it 
is  that  value  of  x  which  is  exceeded  by  half  the  members  of  the  sample.  For  this 
reason  it  is  often  used  as  an  average,  that  is,  a  single  (more  or  less  central)  value 
which  may  be  regarded  as  in  some  sense  typical  of  the  whole  sample.  For  a 
small  sample  the  median  is  the  middle  one  when  the  items  are  arranged  in  order 
(if  the  number  of  items  is  even,  the  median  is  usually  taken  half-way  between  the 
two  middle  ones). 

The  median  of  a  population  with  distribution  function  F(x)  is  that  value  x  for 
which  F(x)  =  0.5.  The  median  is  therefore  easily  marked  on  an  ogive  or  on  a 
cumulative  frequency  polygon.  In  Fig.  14b,  x  is  the  abscissa  of  the  point  on  the 
polygon  with  ordinate  N/2  =  500.  If  this  point  lies  (as  it  will  usually)  on  one 
of  the  straight  sides  of  the  polygon,  the  abscissa  may  be  calculated  by  linear 
interpolation  between  the  values  at  the  beginning  and  end  of  this  side.  Thus  in 
Table  2.2,  N/2  =  500.  The  value  488  of  F corresponds  to  an  x  of  47.5  lb  and  the 
value  751  to  an  x  of  51.5  lb.  The  value  x  corresponding  to  F  =  500  is  therefore 

47  5  _| x  4  =  47.68  lb.  The  assumption  underlying  this  compu- 
tation is  that  the  items  of  the  sample  in  any  class  may  be  regarded  as  having 
values  of  x  which  are  distributed  approximately  uniformly  over  the  whole  class 
interval. 
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For  a  discrete  distribution,  such  as  that  of  Table  2.1,  the  median  is  not  a  very 
precise  quantity.  It  is  obvious  that  x  =  3,  but  there  are  80  individual  values  all 
equal  to  3  and  only  156  out  of  400  are  definitely  less  than  3. 

The  values  of  x  which  correspond  to  cumulative  frequencies  of  N/4  and 
3N/4  are  called  theirs/  quart  He  and  the  third  quart  He  respectively  (often  denoted 
by  2i  and  Q3).  One  quarter  of  the  sample  is  below  gx  and  three  quarters  below 
Q3.  Similarly,  deciles  (corresponding  to  tenths)  and  percentiles  (corresponding  to 
hundredths)  may  be  defined.  The  third  decile,  D3,  for  instance,  is  the  value  of  x 
corresponding  to  a  cumulative  frequency  of  3/10,  and  might  equally  well  be  called 
the  thirtieth  percentile,  P30.  Since  these  points  correspond  to  certain  specified 
fractions  of  the  distribution  they  are  collectively  called  fractiles  (or  sometimes 
quan  tiles). 

In  general,  the  kth  percentile  (Pk)  corresponds  to  a  cumulative  frequency 
equal  to  k%  of  N.  The  number  k  is  called  the  percentile  rank  of  Pk.  It  may  be 
calculated  by  interpolation  from  a  table  such  as  Table  2.2. 

Fractiles  are  often  computed  in  order  to  obtain  a  measure  of  the  spread  or 
dispersion  of  a  distribution.  The  more  a  distribution  is  concentrated  around  a 
central  value,  the  less  as  a  rule  will  be  the  distance  between  Q1  and  Q3,  or 
between  say  P-,  and  P93,  so  that  the  differences  Q3  —  Ql  or  P93  —  Pn  may  be 
taken  as  measuring  the  dispersion.  Deciles  are  often  used  by  psychologists  in 
assessing  the  performance  of  a  student  on  some  aptitude  or  achievement  test. 
If,  for  example,  a  certain  student's  score  is  known  to  lie  between  D8  and  D9  as 
determined  from  a  large  group  taking  the  test,  we  can  say  that  this  student  is 
better  than  eight-tenths  of  the  group  but  is  not  in  the  top  tenth,  on  this  particular 
test. 

2.6  Fractiles  as  Statistics  The  median  and  the  other  fractiles  belong  to  the 
class  of  "statistics,"  which  in  this  sense  (as  a  plural  word)  means  quantities 
calculated  from  experimental  or  observational  data  for  a  sample  and  used  to  make 
estimates  about  the  population  from  which  the  sample  is  drawn.  The  median  of 
a  sample  is  one  statistic  which  gives  information  about  the  population,  and  the 
interquartile  range,  Q3  —  Qu  is  another.  However  there  are  some  statistics 
which  are  better  than  others  for  the  purpose  of  giving  reliable  information.  As 
an  average,  the  median  has  the  disadvantage  that  it  does  not  use  all  the  data 
available  in  the  sample.  It  depends  only  on  the  order  of  the  observations  and  not 
directly  on  their  actual  size.  The  median  of  the  numbers  1,7,  11,  12,  19,  26,  and 
34  is  12,  since  this  is  the  middle  number,  but  any  other  set  of  seven  numbers 
arranged  in  ascending  order  with  12  in  the  middle  would  have  the  same  median. 

Statistics  differ  also  in  the  extent  of  their  sampling  fluctuations,  that  is,  in  the 
extent  to  which  their  numerical  values  vary  from  one  sample  to  another,  of  the 
same  size  and  drawn  from  the  same  population.  Other  things  being  equal,  a 
statistic  with  the  least  possible  sampling  fluctuation  will  be  preferred.  It  turns 
out  that  in  most  situations  the  median  has  a  greater  sampling  fluctuation  than  the 
arithmetic  mean,  which  belongs  to  the  class  of  statistics  known  as  moments,  and 
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in  general  the  arithmetic  mean  will  be  the  statistician's  preferred  average.  The 
mean  has  the  further  advantage  that  it  lends  itself  more  readily  than  the  median 
to  mathematical  manipulation.  A  fuller  discussion  of  the  relative  merits  of 
several  different  types  of  average  may  be  found,  for  example,  in  [2]. 


2.7  Moments  Suppose  that  a  discrete  variate  X  can  take  k  distinct  values 
xt  (i  =  I,  2  ...  k),  and  that/)  individuals  in  a  sample  have  the  vaLue  xt.  The 
total  size  of  the  sample  is  £  fi  =  N.  The  rth  moment  of  X  about  zero  is  defined  by 


(2.7.1) 


m 


N-'E/iX-/ 


Obviously  m'0  =  1.   The  most  important  case  is  when  r  =  1 ;  the  statistic  m\  is 

called  the  arithmetic  mean,  and  it  will  be  convenient  to  denote  it  simply  by  m. 

The  notation  X  is  also  commonly  used,  and  serves  to  indicate  that  the  arithmetic 

mean  is  a  quantity  of  the  same  physical  nature  as  X.  If  X  is  a  length  measured  in 

centimeters,  then  X  will  also  be  a  length  in  centimeters. 

To  calculate  m  for  a  distribution  such  as  that  of  Table  2.1,  it  is  merely 

necessary  to  form  a  column  of  values  of  fx  and  total  it  (see  column  3  of  Table 

1308       „  „ 
2.3).   Here  m  =  -r-  =  3.27. 


400 


Table  2.3 


X 

/ 

fx 

/x2 

A2 

1 

73 

73 

73 

73 

2 

83 

166 

332 

664 

3 

80 

240 

720 

2160 

4 

57 

228 

912 

3648 

5 

41 

205 

1025 

5125 

6 

66 

396 

2376 

14256 

400 

1308 

5438 

25926 

The  fourth  and  fifth  columns  of  the  table  give  the  second  and  third  moments 

5438         „      r         t      ,        25926 
respectively.    Here  m'2  =  — —  =  13.595,  and  m  3  =  — — -  =  64.815. 

4UU  4UU 

In  dealing  with  a  grouped  distribution  such  as  that  of  Table  2.2,  the  moments 
are  calculated  on  the  assumption  that  all  the  individuals  in  a  class  can  be  re- 
garded as  having  the  central  value  (or  class-mark)  of  that  class.  Although  this 
is  not  actually  true,  the  errors  caused  by  grouping  are  not  usually  serious  unless 
the  grouping  is  very  coarse.  (For  a  method  of  correction,  see  §  5.10.) 

The  numerical  labor  of  the  calculation  may  generally  be  substantially  reduced 
by  suitable  coding.  If  one  of  the  class-marks  (x0)  is  chosen  near  the  center  of  the 
table,  and  a  new  variable,  w,  is  defined  by 

(*  ~  *o) 
(2.7.2)  u  = 
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where  c  is  the  class-interval,  the  values  of  u  will  as  a  rule  be  much  simpler  to 
work  with  than  the  original  x  values.  As  an  illustration,  Table  2.4  shows  the 
procedure  for  calculating  the  first  three  moments  about  zero  from  the  data  of 
Table  2.2,  where  x0  has  been  taken  as  45.5  and  c  as  4.  The  values  of  m'r  (r  =  0,1, 
2,  3)  are  given  in  the  last  row  of  the  table.  The  last  column  is  a  check  column 
(suggested  by  Charlier).   The  check  depends  on  the  identity: 

(2.7.3)  £  ftu,  +  l)3  =  £  ftu?  +  3  £  fiUl2  +  3  £  fiUl  +  £  /, 

Table  2.4 


X 

u 

/ 

fii 

fU* 

fu3 

/("  +  l)3 

29.5 

-4 

1 

-4 

16 

-64 

-27 

33.5 

-3 

14 

-42 

126 

-378 

-112 

37.5 

-2 

56 

-112 

224 

-448 

-56 

41.5 

-1 

172 

-172 

172 

-172 

0 

45.5 

0 

245 

0 

0 

0 

245 

49.5 

1 

263 

263 

263 

263 

2104 

53.5 

2 

156 

312 

624 

1248 

4212 

57.5 

3 

67 

201 

603 

1809 

4288 

61.5 

4 

23 

92 

368 

1472 

2875 

65.5 

5 

3 
1000 

15 
553 

75 
2471 

375 
4105 

648 

14177 

m'r 

1 

0.553 

2.471 

4.105 

Thus  14,177  =  4105  +  3(2471)  +  3(553)  +  1000. 

The  values  of/(w  +  l)3  are  found  by  multiplying  each /by  the  cube  of  the  u 

in  the  next  line  (since  the  values  of  u  increase  by  1  as  we  go  down  the  column). 

The  arithmetic  mean  of  the  original  variate  X  is  found  by  multiplying  the 
moment  m\  (in  terms  of  u)  by  c  and  adding  x0.   Here 

m  =  4(0.553)  +  45.5  =  47.712  lb. 


2.8  Moments  about  the  Mean.    Variance,  skewness  and  kurtosis    The  rth 

moment  about  the  mean  of  the  variate  X  is  defined  as : 

(2.8.1)  mr  =  N-1Zfi(xi-my 

As  before  we  see  that  m0  =  1.   Also,  putting  r  =  1,  we  have 

(2.8.2)  m1  =N-1YJfi*i-™=0 

for  every  sample,  so  that  m1  does  not  tell  us  anything  about  the  sample.  How- 
ever, ra2,  m3,  m4  .  .  .  are  statistics  which  are  often  used  for  expressing  the 
characteristics   of  a  sample  and  making  inferences   about  the  population. 


36  INTRODUCTION  TO  STATISTICAL  INFERENCE  2.8 

Historically  they  have  been  of  great  importance  in  the  development  of  mathe- 
matical statistics,  but  the  related  ^-statistics  (see  §  5.4)  are  now  recognized  to  be 
more  convenient. 

The  second  moment,  ra2,  is  often  called  the  variance.  It  is  given  by 

(2.8.3)  m2=N-iYjfi(xi-m)2 

=  N_1(Z/iV-2mI/lxl+-m3X/i) 

=  m'2  —  m2 

since  ]T  /.x.  =  Nm,  and  £  ft  =  N. 

The  positive  square  root  of  the  variance  is  called  the  standard  deviation, 
denoted  usually  by  s.  It  is  the  most  widely-used  measure  of  the  spread  of  a  sample 
distribution.  Many  authors,  however,  prefer  to  define  the  variance  as  the 
second  /^-statistic,  k2,  which  is  related  to  m2  by  means  of  the  equation 

(2.8.4)  fc2=^-l.  =  (iV-l)-1Xy;(x,.-m)2 

This  definition  has  some  advantages  from  the  point  of  view  of  statistical 
inference,  and  we  shall  adopt  it  in  this  book.  Of  course,  in  large  samples  there  is 
little  difference  between  k2  and  m2. 

The  spread  of  a  distribution  is  most  naturally  and  easily  measured  by  the 
range  of  the  variate,  that  is,  by  the  difference  xN  —  xlt  where  the  measured  values 
xl9  x2  .  .  .  xN  are  supposed  to  be  arranged  in  increasing  order  of  size.  The 
range,  however,  is  not  very  convenient  mathematically  and  is  apt  to  be  sensitive, 
in  large  samples,  to  sampling  fluctuations.  The  standard  deviation  makes  use  of 
all  the  information  in  the  sample,  and  is  generally  the  most  reliable  measure  of 
spread,  even  though  it  is  a  little  more  troublesome  to  calculate.  The  meaning  of 
the  standard  deviation  is  perhaps  most  easily  grasped  by  noting  that  in  a  good 
many  common  types  of  distribution,  which  are  more  or  less  symmetrical  about  a 
central  value  and  tail  off  in  both  directions,  roughly  two-thirds  of  all  the  variates 
(in  a  rather  large  sample)  will  lie  within  an  interval  of  x,  extending  for  one  stan- 
dard deviation  on  either  side  of  the  mean.  If  the  sample  consists  of  several 
hundred  individuals,  the  standard  deviation  will  usually  be  (roughly)  one-sixth  of 
the  range.  This  is  worth  remembering,  as  a  guard  against  gross  errors  (such  as 
misplacing  a  decimal  point)  in  the  calculation  of  the  standard  deviation.  Reasons 
for  these  statements  will  appear  when  the  normal  distribution  is  discussed  (see 
§§3.13  and  8.21). 

The  third  moment,  ra3,  and  the  fourth  moment,  m4,  depend  on  the  shape  of 
the  frequency  polygon  representing  the  distribution.  Because  of  the  cancelling 
of  positive  and  negative  third  powers,  a  symmetrical  distribution  will  have 
m3  =  0.  A  distribution  with  a  long  tail  extending  out  to  the  right  will  usually 
have  a  positive  ra3,  while  one  with  a  long  tail  out  to  the  left  will  usually  have  a 
negative  m3.  This  is  because  the  positive  values  of  (x  —  mf  tend  to  outweigh 
the  negative  values,  or  vice  versa.  The  statistic  m3  may  therefore  be  used  as  a 
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measure  of  the  skewness  of  the  distribution.  It  is  not  true  however  that  if  m3  =  0 
then  the  distribution  is  necessarily  symmetrical  [3].  To  have  a  measure  inde- 
pendent of  the  units  of  x,  it  is  customary  to  divide  m3  by  (m2)3/2.  The  skewness 
is  then  a  pure  number. 

The  fourth  moment,  ra4,  divided  by  m22  is  a  measure  of  kurtosis.  This  word, 
which  means  "peakedness,"  was  adopted  because  in  some  common  types  of 
distribution  a  high  value  of  ra4  is  associated  with  a  high  central  peak  in  the 
frequency  polygon.  However,  the  value  of  m4  is  very  much  dependent  on  the 
shape  of  the  tails  [4]  and  may  have  little  to  do  with  any  central  peak. 

As  we  shall  see  later,  the  k  statistics  k3  and  k4  are  more  convenient  than  m3 
and  mA  for  measuring  skewness  and  kurtosis  respectively.  We  are  usually 
interested  in  estimating  these  characteristics  for  a  population,  and  the  ^-statistics 
give  better  estimates  than  the  moments. 

2.9  Moments  for  a  Probability  Distribution  In  many  problems  of  statistical 
inference,  samples  are  drawn  from  a  population  which  is  assumed  to  be  described 
by  a  known  type  of  mathematical  distribution  function.  There  may  be  one  or 
more  parameters  occurring  in  this  function;  the  values  of  these  parameters 
are  not  known  but  can  be  estimated  from  the  samples.  One  of  the  more  common 
methods  of  estimation  makes  use  of  the  relation  between  the  moments  of  a  sample 
and  the  moments  of  the  population.  For  the  present,  the  population  will  be 
thought  of  as  infinitely  large  and  characterized  by  its  distribution  function  F(x). 
The  random  variable  X  may  be  discrete  or  continuous,  and  in  the  latter  case  a 
density  function /(x)  will  exist  (see  §  1.16).  Since  F(x)  is  the  probability  that 
X  <  x,  the  distribution  of  X  in  the  population  is  often  called  a  probability 
distribution. 

If  X  is  discrete,  and  if  the  probability  is  pj  that  it  takes  the  value  xj9  the 
expectation  of  Xr  is  defined  as 

(2.9.1)  iifr=E{Xr)=^x/Pj 

j 

This  is  called  the  rth  moment  of  X  about  zero.  If  there  are  infinitely  many  possible 
values  of  7,  it  is  assumed  that  the  sum  converges. 

The  first  moment,  n'u  is  the  expectation  of  X  and  is  often  called  the  population 
mean.  It  will  be  denoted  by  fi,  without  prime  or  subscript,  and  corresponds  to  the 
sample  statistic  m\{=  m)  previously  defined  in  §  2.7.  We  shall  adopt  as  far  as 
possible  the  very  useful  convention  of  distinguishing  between  sample  and 
population  by  the  use  of  Latin  letters  such  as  m  for  the  former  and  Greek  letters 
such  as  ix  for  the  latter. 

If  X  is  continuous,  the  definition  corresponding  to  equation  (1)  is 


(2.9.2)  fi'r  =  E(Xr)  = 

provided  that  the  integral  exists. 


xrf(x)  dx 
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The  expectation  of  (X  —  li?  is  called  the  rth  moment  about  the  mean,  and  is 
given  by 

(2.9.3)  li,  =  E(X  -  &  =  £  (xj  -  itfPj 

j 

when  X  is  discrete,  or  by 

/*00 

(2.9.4)  li,  =  E(X  -nY=\      (x-  Li)rf(x)  dx 

J  —  oo 

when  Xis  continuous.  Since  Y,j  Pj  —  1  and  f(x)  dx  =  1,  it  is  obvious  that 

li0  =  1  in  all  cases.  Also, 

(2.9.5)  Mi  =  £(*  -  ju)  =  £(*)  "  i"  =  0 

for  any  kind  of  distribution  that  possesses  a  mean.  The  lowest  useful  value  of  r 
is  2,  and  li2  is  a  very  important  descriptive  parameter  for  the  population,  known 
as  the  population  variance.  The  square  root  of  li2  is  usually  denoted  by  o  and  is 
called  the  population  standard  deviation.  It  is  a  measure  of  the  spread  or  dispersion 
of  the  population  about  its  mean.  The  sample  standard  deviation  s  defined  in 
§  2.8  provides  an  estimate  of  a. 

Using  the  binomial  theorem,  we  may  write 

(2.9.6)  (*-ny-.I(-l),('W"V 

This  allows  us  to  express  the  moments  li,  in  terms  of  the  moments  li ',  (which  are 
usually  easier  to  calculate).  We  have,  assuming  that  moments  up  to  the  rth  exist, 

(2.9.7)  li,  =  E(X  -fi)r=t  (-iy(r)l*E(X'-*) 

q  =  0  \H/ 

Thus,  for  example, 

(2.9.8)  li2  =  1'Li'2-2iili\  +1-mV0 

=  I*'*  -  M2 
(note  that  ll\  =  li  and  li'0  =  1) 

(2.9.9)  li3  =  ll'3  -  3li'2li  +  2>ll\li2  -  li'0ll3 

=  !*>'*  ~  Wil*  +  2M3 

(2.9.10)  iiA  =  li\  -  4li'3li  +  6li'2li2  -  4li\ll3  +  li'0ll4 

=  \i\  -  4ll'3li  +  6li'2li2  -  3li* 

The  quantities  li3jg3  and  liJo4  are  measures  of  skewness  and  kurtosis,  re- 
spectively, for  the  population. 
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Example  1  Assuming  that  a  well-made  die  has  a  probability  ^  of  falling 
with  any  specified  face  uppermost,  the  expectation  of  the  number  of  spots  on 
the  upper  face  is 


(2.9.11) 

\i 

6 

-I 

7=1 

i-i  = 

i-21  =3.5 

The  variance 

of  this  number 

is 

(2.9.12) 

^2   - 

f'l- 

S 

6 

If 

■l-n2 

1:91 

-  (3.5)2 

2.917 

so  that  o  =  1.71.    Note  that  the  sum  of  the  integers  from  1  to  n  is  given  by 
\n{n  +  1)  and  the  sum  of  the  squares  of  the  integers  from  1  to  n  is 

\n{n  +  l)(2n  +  1). 

Example  2    Suppose  that  a  continuous  variate  X  may  take  values  between 
0  and  2,  with  a  density  function 

fix)  -x,        0  <  x  <  1 

fix)  =  2-x,         l<x<2 

The  graph  is  triangular,  with  a  vertex  of  height  1  at  x  =  1 . 

The  expectation  of  X  is  obviously  1  (from  considerations  of  symmetry)  but 
may  be  calculated  from  the  relation 


-l> 


f(x)  dx 


J>- 


x2  dx  +  i1      '      - 
o 


The  variance  of  Xis  ju2  =  \jl '2  —  /*2,  where  \i  2  =       x2fix)  dx  =  -J.  Therefore, 
H2  =  i  and  a  =  0.408. 

2.10  Generating  Functions  It  is  often  convenient  mathematically  to  con- 
sider functions  which  serve  to  ''generate"  moments  or  other  characteristics  of  a 
population.  When  such  a  function  of  a  real  variable  (say  h)  is  expanded  in  powers 
of  /?,  the  coefficients  of  h,  /*2/2!,  /z3/3!  .  .  .  form  the  set  of  moments  or  other 
quantities.  It  is  in  this  sense  that  these  quantities  may  be  thought  of  as  generated 
by  the  function. 
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For  a  discrete  variate  the  moment  generating  function  (m.g.f.)  is  defined  by 

(2.10.1)  M(h)  =  E(ehX)  =  £  ehxjp(Xj) 

j 

h2 

=  £(1  +  /DC,-  +-*/  +...)/>(*,•) 

h2  °°  u'  hr 

where  it  is  assumed  that  the  indicated  sum  converges. 
For  a  continuous  variate  the  m.g.f.  is  defined  by 


(2.10.2)  M(h)=\      ehxf(x)dx 


-i: 


JlX4 

00 


provided  that  the  integral  converges,  and,  if  so,  this  reduces  to  the  same  series  as 
in  equation  (1)  above. 

Example  3     For  a  symmetrical  die,  the  values  of  x}  are  1,  2,  3,  4,  5,  6,  and 
p(xj)  is  ^  for  each  of  them.  Therefore, 

M(h)  =  i(eh  +  e2h  +  .  .  .  +  e6h) 

=  ieh(e6h-l)(eh-iyl 

Written  out  in  powers  of  h,  this  function  becomes 


1/  h2  h3  \ 


where        S,  =  1  +  2  +  .  .  .  +  6  =  21,        S2  =  l2  +  22  +  .  .  .  +  62  =  91, 

S3  =  i3  +  23  +  ...+63=441,  etc. 
Therefore,  /i'j  =  Sj/6  =  7/2,        ju'2  =  52/6  =  91/6,  etc. 

Example  4     For  the  continuous  rectangular  distribution  specified  by 

f(x)  =— ,  -  a  <x<a 
2a 


f(x)  =0,         x  <  —a     or     x  >  a 


we  obtain 


(2.10.3)  M(h)  =  ^  f-  ^  ^x  =  i  (* 


—  sinh  fl/z. 


2.12  FREQUENCY  DISTRIBUTIONS,   FRACTILES   AND   MOMENTS  41 

Expressed  as  a  power  series, 

a2h2      aAhA 
MW  =  1+— +—  +  ... 

so  that  the  odd  moments  are  all  zero  and  the  even  moments  are  given  by 

M2=y,        M4=~5~'        etC* 

These  are  the  coefficients  of  /z2/2!,  /z4/4!,  etc.  in  the  above  series.  Since  the  mean 
of  the  distribution  is  zero,  the  moments  /zr  are  the  same  as  the  moments  \fr. 
A  fuller  discussion  of  moment  generating  functions  may  be  found  in  [5]. 

*  2. 1 1  Factorial  Moments  For  a  variate  X  which  is  discrete  and  takes  values 
spaced  at  unit  intervals  it  is  sometimes  convenient  to  use  factorial  moments, 
defined  by 

(2.H.1)  H\r)=l(Xj)rP(Xj) 

J 

where        (*,),  =  x/x,-  -  l)(x;  -  2) .  .  .  (xy  -  r  +  1). 

For  the  die  of  Example  1  in  §  2.9  we  have  //(1)  =  21/6,  //(2)  =  70/6,  //'(3) 
=  210/6,  etc.  The  highest  non-zero  moment  is  //(6)  =  5!. 

The  factorial  moment  generating  function  (f.m.g.f.)  is  given  by 

(2.11.2)  G(/0=I(l+fc)^(*y) 

j 

For  the  die  just  mentioned  this  becomes 
G(h)=\  t(i+hy 

6;=i 

1       lh      35  l2      35j3      7,4      7,s      !,6 
2        6  6  2  6  6 

2. 12  Cumulants  If  the  logarithm  (to  base  e)  of  the  m.g.f.  can  be  expanded  as 
a  series  of  powers  of  h  (which  converges  in  some  interval  including  h  =  0)  in 
the  form 

(2.12.1)        K(h)  =  log,  M(h)  =  Klh .+  k2  -  +  k3  -  +  .  .  .  =  f  Kr  ^ 

2!  3!  r=i     r! 

then  the  coefficients  K:r  ofhr/r\  are  called  the  cumulants  of  the  distribution  (jc  is  the 
Greek  letter  kappa)  and  K(h)  is  called  the  cumulant  generating  function  (c.g.f.). 
The  cumulants  play  an  important  role  in  sampling  theory,  as  was  first  pointed 
out  by  Sir  R.  A.  Fisher,  who  emphasized  their  advantages  over  moments.  As 
will  be  seen  shortly,  the  first  cumulant  is  the  same  as  n\  (the  population  mean) 
and  the  second  and  third  are  the  same  as  \i2  and  /*3  respectively.  The  higher 
cumulants  differ,  however,  from  the  corresponding  moments. 
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If  the  origin  is  taken  at  the  population  mean  (so  that  n\  =  0  and  \i! r  is 
therefore  the  same  as  \ir  for  all  r),  we  have 

(2.12.2)  M(/z)  =  l+  f>r-' 

Also,  by  the  definition  of  K(h), 
r2t23,  dK{h)  _     1     dMW 

h2  h3 

=  Kt  +  K2h  +  K3  —  +  K4  —  +  .  .  . 

From  the  definition  of  Af(/z)  in  Eq.  (2), 

dM(/z)  ,  h2  h3 

(2A2A)  ~^r  =  M2/Z+^32!+/l43!+'" 

so  that,  from  Eqs.  (2),  (3)  and  (4), 

/  h2  h3  \(  h2  h3  \ 

U+j"2^+^3j!+--'j^1+/C2/l+K3-+K4-  +  ...l 

h2  h3 

By  equating  coefficients  of  corresponding  powers  of  h  on  the  two  sides  of  this 
equation,  we  find 

(2.12.5)  *!  =  0,  k2  =  n2,  k3  =  fi3,  ka=\i^-  3/j22,  .  .  . 

The  most  common  measure  of  kurtosis  for  a  population  is  kJk22  =  (fiJfi22)  —  3. 
If  the  origin  from  which  x  is  measured  is  changed  from  x  =  0  to  x  =  a.  any 
given  value  x  is  changed  to  x  —  a.  This  will  make  no  difference  to  any  of  the 
moments  about  the  mean  (since  the  mean  will  also  be  changed  to  n  —  a),  and 
therefore  will  not  change  any  of  the  cumulants  from  k2  on.  The  first  cumulant, 
however,  becomes  kx  —  a.  In  the  above  derivation  a  was  taken  as  the  population 
mean  ju,  so  that  before  the  shift  we  should  have 

(2.12.6)  K,=fi. 

If  the  scale  of  measurement  is  altered  so  that  a  value  previously  recorded  as 
x  now  becomes  bx,  the  effect  on  M(h)  is  to  replace  h  by  bh.  The  rth  moment 
(whether  about  the  origin  or  the  mean)  and  the  rth  cumulant  are  multiplied  by  br. 
For  the  f.m.g.f.,  a  change  of  origin  has  the  effect  of  multiplying  G(h)  by  (1  +  h)~a 
and  a  change  of  scale  replaces  1  +  h  by  (1  +  h)b. 

The  most  important  property  of  moment  generating  functions  is  that  if 
Xl9  X2  .  .  .  XN  are  independent  variates,  and  if  L  is  a  linear  function  of  the  X's 
given  by  L  =  c1Xl  +  c2X2  +  .  .  .  +  cNXN  (the  c's  being  arbitrary  constants, 
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not  all  zero),  then  the  m.g.f.  of  L  is 

(2.12.7)  M(h)  =  YlMj(cjh) 

i.e.,  it  is  the  product  of  the  m.g.f.'s  for  the  separate  variates,  with  Cjh  substituted 
for  h.  From  the  definition  of  K(h)  it  follows  that 

(2.12.8)  K(h)  =  £  Kj(cjh) 

j=i 

so  that  the  separate  c.g.f.'s  are  added.  To  find  the  c.g.f.  for  a  sum  of  independent 
variates  all  we  have  to  do  is  to  add  the  c.g.f.'s  for  the  variates  taken  separately. 
This  is  the  principal  reason  for  introducing  cumulants. 

In  §  2.8  the  ^-statistics  were  briefly  mentioned.  These  are  simply  related  to 
the  cumulants  of  the  population.  In  fact,  the  expectation  of  the  rth  ^-statistic  kr 
is  the  corresponding  cumulant  Kr.  The  kr  are  therefore  often  used  as  estimates 
of  the  cumulants  Kr  (at  any  rate  for  r  <  4).   See  §  5.8. 

*  2.13  Characteristic  Functions  For  some  distributions  the  moment  gener- 
ating function  does  not  exist.  If,  however,  we  replace  h  by  ih  (i  =  V  —  1)  in  the 
definition  (2.10.1)  and  define  the  characteristic  function  by 

(2.13.1)  C(h)  =  X  e""p(xj) 

J 

or 

(2.13.2)  C(h)  =         eihxf(x)  dx 


then  C(h)  always  exists,  as  a  complex  number,  for  any  distribution  for  which  the 
p(xj),  or/(x),  are  defined.  It  may  be  written  as  a  series: 

h2  ih3  h4 

(2.13.3)  C(h)  =  1  +  ihn\  -  —  fi'2  -  —  \x'z  +  —  \i\  +  .  .  . 

and  so  may  be  regarded  as  generating  moments  in  the  same  sort  of  way  as  M(h). 
It  may  be  noted  that  corresponding  to  (2)  there  is  a  reciprocal  relation 


(2.13.4)  2nf{x) 


C(h)  dh 


The  density  function /(jc)  is  said  to  be  the  Fourier  transform  of  C{h).  Tables  of 
the  Fourier  transform  (such  as  those  in  [6])  may  be  useful  in  finding  C(h),  given 
f(x),  or  vice  versa. 

2.14  Bienayme's  Theorem  If  Xu  X2  .  ...  XN  are  pairwise  independent 
variates  (see  §  1.10),  and  if  L  is  a  linear  combination  given  by  L  =  Jj  CjXjy 
then  the  variance  of  L  is 

(2.14.1)  V(L)=Ycj2V(XJ) 
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If,  for  example,  all  the  Cj  are  equal  to  1 ,  the  theorem  states  that  the  variance  of  a 
sum  of  pairwise  independent  variates  is  equal  to  the  sum  of  their  variances. 

To  prove  this,  let  E{ Xj)  be  p(j),  where  they  is  placed  in  parentheses  to  show  that 
pU)  is  not  ayth  moment  but  the  first  moment  of  they th  variate.  By  Theorem  1.16, 

E(L)  =  X  cjiiu) 
j 
From  the  definition  of  variance. 

(2.14.2)  V(L)=  E[L-E(L)y 
ZcjiXj-n^J 

=  E  I  Cj2(Xj  -  fi(j))2  +  EYZcjCk(Xj  -  p{j))(Xk  -  p(k)) 

(Note  that  in  forming  the  square  of  a  sum  of  N  quantities  there  are  N  terms  in 
which  each  quantity  is  squared,  and  N(N  -  1)  terms  in  which  each  quantity  is 
multiplied  by  a  different  one.  These  latter  are  the  double  sum  above  for  which 
j  *  k). 

Now  V(Xj)  =  E(Xj  —  ju0-))2.  Also  we  may  define  the  covariance  of  two 
distinct  variates,  Xj  and  Xk,  by 

(2.14.3)  C(XJ9  Xk)  =  E[(Xj  -  fi{j))(Xk  -  fiik))-] 
Equation  (2)  then  states  that 

(2.14.4)  V(L)  =  X  Cj2V(Xj)  +  X  cjCkC(Xj,  Xk) 

j  J*k 

If  Xj  and  Xk  are  independent,  it  follows  from  Theorem  1.17  that 
.  E[(Xj  -  fi(J))(Xk  -  /i(fc))]  =  E(Xj  -  fi{j))-E(Xk  -  n(k)) 

=  0 

since  E(Xj)  =  fiU)  by  definition.  Equation  (4)  therefore  reduces  to  Eq.  (1). 

Variates  which  are  such  that  their  covariance  is  zero  are  said  to  be  uncor- 
related.  It  is  sufficient  for  this  theorem  that  the  variates  should  be  pairwise 
uncorrected,  and  they  need  not  be  independent  in  the  full  sense.  (See  §  1.10.) 

The  Pearson  coefficient  of  correlation  between  two  variates  Xj  and  Xk  is 
defined  by 

C(Xj,  Xk) 

(2.14.5)  Pjk~[V(Xj)'V(Xk)y/2 

It  is  a  pure  number  with  range  from  —  1  to  +1  inclusive,  and  is  zero  when  Xj  and 
Xk  are  uncorrelated.  If  we  write  V{Xj)  =  of,  and  C(Xjt  Xk)  =  pjkGjGk,  equation 
(4)  above  becomes 

(2.14.6)  V(L)  =  X  Cj2Gj2  +  X  CjCkPjkGjGk 

j  j*k 

From  the  definition  it  is  obvious  that  pjk  =  pkj. 
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The  coefficient  of  correlation  is  important  in  problems  involving  two  or 
more  variates,  and  its  properties  will  be  discussed  more  fully  in  chapter  1 1 .  The 
method  of  calculating  a  sample  statistic  r  for  estimating  p  is  given  in  §  11.14. 

2. 1 5  Markov's  Inequality  This  states  that  if  X  is  a  non-negative  variate  with 
expectation  /*,  and  if  X  is  any  real  positive  number, 

(2.15.1)  P(X  >  X)  <  \i\X. 

To  prove  this,  we  note  that  the  set  of  all  possible  values  of  X  can  be  divided 
into  two  sub-sets  A\  and  X2>  where  Xt  contains  all  values  >  X  and  X2  all  values 
<X.  By  definition, 


(2.15.2)  fi=E(x)  = 


I 


xp(x)  dx 
xp(x)  dx  + 

o 

xp(x)  dx 


xp(x)  dx 
x 


since  p{x)  is  never  negative. 

But  since  in  this  last  integral  x  is  not  less  than  A, 


'oo 

(2.15.3)  |     xp(x)dx>X 


i 


'00 

p(x)  dx  =  XP(X  >  X). 
x 


From  (2)  and  (3), 

\i  >  XP(X  >  X) 
which  is  equivalent  to  (1). 

2.16  Chebyshev's  Inequality  (attributed  also  to  Bienayme)  This  is  a  de- 
duction from  Markov's  inequality  and  states  that  for  a  variate  X,  possessing 
first  and  second  moments, 

(2.16.1)  P(|X-^|>A)<^ 

where  E(X)  =  /x,  V(X)  =  a2. 

If  in  (2.15.1)  we  substitute  for  X  the  non-negative  quantity  (X  -  /j)2,  for 
which  the  expectation  is  a2,  we  obtain 

P[(X  -  „)2  >  A2]  <  j2 

But  to  say  that  (X  —  fi)2  >  X2  is  the  same  as  to  say  that  | X  —  ju\  >  X,  whence  the 
theorem  follows. 
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Example  5  If  X  is  the  sum  of  spots  showing  up  on  two  good  dice,  it  is 
easily  calculated  that  E(X)  =  7  and  V(X)  =  35/6.  (The  two  dice  are  supposed 
to  fall  independently,  and  Bienayme's  theorem,  with  Theorem  1.16,  gives  the 
result.) 

The  probability  that  \X  —  l\  >  4  is  therefore  <  35/96.  The  actual  proba- 
bility is  1/6,  so  that  the  inequality  is  not  very  sharp.  However,  Chebyshev's 
inequality  is  of  great  theoretical  importance  because  of  its  wide  applicability  to 
a  variety  of  distributions. 

Several  similar  inequalities,  usually  requiring  further  restrictions  on  the 
variate,  are  known.   See  [7]. 

2.17  The  Joint  Distribution  of  Two  Variates  We  have  seen  that  the  expec- 
tation and  the  variance  can  be  readily  obtained  for  a  sum  of  random  variables,  if 
the  separate  expectations  and  variances,  as  well  as  the  covariances,  are  known. 
If  the  variates  are  independent,  the  moment  generating  function  and  cumulant 
generating  function  for  the  sum  are  easily  found  from  those  for  the  individual 
variates  (§2.12).  If,  however,  the  distribution  function  itself  is  required,  the 
calculation  is  usually  more  difficult,  even  for  independent  variates. 

Let  us  first  suppose  that  X  and  Y  are  continuous  variates.  If  f(x,  y)  dx  dy 
is  the  probability  that  at  the  same  time  Stakes  the  value  x  (to  x  +  dx)  and  Y the 
value  y  (to  y  +  dy),  then/(x,  y)  is  called  the  joint  probability  density  for  X  and  Y. 
The  density  for  X  alone,  regardless  of  Y,  is 


f 


(2.17.1)  g(x)=\      f(x,y)dy 

J  —  00 

and  the  density  for  Y  alone,  regardless  of  X,  is 

(2.17.2)  h(y)=\      f(x,y)dx 

J  —  oo 

The  variates  X  and  Y  are  independent  if,  and  only  if, 

(2.17.3)  /(*,  y)  =  g (x)h(y) 

The  distribution  functions  for  X  and  7,  respectively,  are 


(2.17.4)  G(x) 


g(u)  du,  H(y)  = 


h(v)  dv 


while  the  joint  distribution  function  is 


(2.17.5)  F(x,y) 


\      f(u,v 

)  J   -  00 


)  du  dv 


If  the  variates  X  and  Y  are  discrete,  there  is  no  density  function  but  the 
distribution  functions  exist.  The  joint  distribution  function  is 

(2.17.6)  F(x,y)=   X     X  f(xh  y,). 

xi<x  y(<y 
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Whether  X  and  Ysltq  continuous  or  discrete,  the  necessary  and  sufficient  condition 
for  independence  is 


(2.17.7) 


F(x,  y)  =  G(x)H(y). 


*  2.18  The  Distribution  Function  for  a  Sum  of  Two  Independent  Variates     Let 

Z  =  X  +  Y  and  let  P(z)  be  the  distribution  function  for  Z.  By  definition,  P(z) 
is  the  probability  that  X  +  Y  <  z.  If  we  plot  possible  values  of  X  and  Y,  using 
rectangular  coordinates  in  the  plane,  the  region  corresponding  to  X  +  Y  <  z  is 
all  that  part  of  the  plane  lying  below  and  to  the  left  of  the  line  X  +  Y  =  z 
(see  Figure  15).   For  any  given  v,  the  probability  that  X  <  z  —  y  is  G(z  —  y), 


Fig.  1 5     Space  of  the  variate  Z  =  X  +  Y 

so  that  the  required  probability  «s  obtained  by  multiplying  G(z  —  y)  by  the 
probability  for  y  and  integrating  over  all  y. 


(2.18.1) 


P(z)  = 


G(z  -  y)h(y)  dy 


By  differentiating  with  respect  to  z  we  obtain 


(2.18.2) 


P(z) 


g(z  -  y)h(y)  dy 


This  is  called  the  convolution  of  the  density  functions  g  and  h.  It  is  the 
density  function  for  Z. 

Example  6  Let  Xhave  a  uniform  distribution  on  (0,  1)  and  Y  a  symmetrical 
triangular  distribution  on  (0,  2).  Then  Z  has  a  distribution,  which  we  wish  to 
find,  on  (0,  3),  since  z  cannot  take  any  values  outside  this  interval. 
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The  given  density  functions  are 

\g{x)  =  1,  0  <  x  <  1 


(2.18.3) 

\h(y)  =  y,  0  <  y  <  1;  h(y)  =  2-y,l<y<2 

Since  g(z  —  y)  is  1  for  values  of  j>  between  z  —  1  and  z,  and  is  0  for  all  other 
values, 

(2.18.4)  P(z)=|       Ky)dy 


Now,  from  the  definition  of  h(y),  p(z)  has  different  expressions  for  the  three 
intervals  of  z,  namely,  (0,  1),  (1,  2)  and  (2,  3).  In  fact, 


p(z)  =      y  dy  =  iz2,  0  <  z  <  1 


(2.18.5)  p(z)  =         y  dy  +     (2  -  y)  dy  =  3(z  -  i)  -  zM  <  z  <  2 


P(z)  = 


2 

,2 


(2  -  y)  <*v  =  |(3  -  z)2,  2  <  z  <  3 


The  graph  of  p(z)  is  formed  of  parts  of  three  parabolas,  joined  together.   It  is 
symmetrical  about  z  =  \\. 

2.19  Joint  Distribution  of  k  Variates  The  notation  and  definitions  of  §  2.17 
may  be  extended  to  three  or  more  variates.  The  variates  Xl9  X2  .  .  .  Xk  are 
independent  if  and  only  if  the  joint  distribution  function  is  equal  to  the  product 
of  the  separate  distribution  functions, 

(2.19.1)  F(xu  x2...xk)=  Fl(xl)F2(x2)  .  .  .  Fk(xk) 

If  the  variates  are  continuous,  possessing  density  functions  /i(xi),  f2 (x2) .  .  . 
fk(xk),  the  joint  density  function  is 

(2.19.2)  f(xl9  x2  .  .  .  xk)  =fl(xl)f2(x2) .  .  .fk(xk) 

As  we  shall  see  later,  this  relation  is  very  useful  in  the  theory  of  sampling.  A 
sample  of  k  items  is  selected  at  random  from  a  large  population  characterized 
by  a  given  or  assumed  distribution  function.  Each  individual  item  may  be  re- 
garded as  an  independent  choice  from  this  distribution,  and  the  probability 
density  of  the  sample  actually  selected  is  the  product  of  the  probability  densities 
for  the  separate  items.  Since  the  various  items  all  have  the  same  density  function 
/(x),  the  joint  probability  density  for  the  sample  with  values  xu  x2,  .  .  .  xk  is 

(2.19.3)  f(xu  x2  .  .  .  xk)  =f(xl)f(x2)  .  .  .f(xk) 
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PROBLEMS 

A  (§§2.1-2.8) 

1.  Criticise  the  following  statements  (occasionally  seen  in  examination  answers): 
"Median  =  N/2,"  "Qi  =  W/4." 

2.  Construct  a  histogram  showing  the  age  distribution  of  deaths  of  infants  under 
one  month  from  the  following  table,  taken  from  an  official  publication  of  the  United 
States  Government : 

Age  at  Death  Frequency 

Under  1  day  26,665 

1  day  8,364 

2  days  6,344 

3  to  6  days  12,375 

1  week  10,911 

2  weeks  7,717 
3  weeks  but  under 

1  month  6,212 


78,588 


Hint:  "1  day"  means  anything  over  one  day  but  under  two  days;  "3  to  6"  means  over 
three  but  under  seven,  and  so  on.  Take  the  month  as  30  days  long.  The  areas  of  the 
rectangles,  not  the  heights,  represent  the  corresponding  frequencies. 

3.  Construct  a  cumulative  frequency  table  and  a  cumulative  frequency  polygon  for 
the  data  in  Problem  2.  What  was  the  approximate  probability  (at  the  time  these  data 
were  collected)  that  an  infant  who  lived  for  less  than  a  month  died  in  the  first  week? 

4.  The  following  table  gives  the  results  of  280  tests  made  on  a  certain  kind  of  coal 
for  ash  content : 

Percentage  Ash        Frequency 

3.0-  3.9  1 

4.0-  4.9  7 

5.0-  5.9  28 

6.0-  6.9  78 

7.0-  7.9  84 

8.0-  8.9  45 

9.0-  9.9  28 

10.0-10.9  7 

11.0-11.9  2 

280 

Calculate  the  median  and  the  first  and  third  quartiles.  Find  the  percentile  rank  of  an 
ash  content  of  8.5%.  Hint:  Form  a  percentage  cumulative  frequency  table,  corres- 
ponding to  the  values  of  xe.  The  percentile  rank  of  8.5  is  the  corresponding  percentage 
cumulative  frequency. 

5.  Calculate  Z)g  and  Pio  for  the  data  of  Problem  2;  also  the  percentile  rank  of  10 
days.  State  in  words  what  each  of  these  statistics  means. 

6.  For  a  set  of  1 5  ungrouped  sample  measurements  we  find  ^x  =  480,  X *2  = 
15,735.  Find  the  mean  and  standard  deviation  of  X. 

7.  For  a  sample  of  size  2,  show  that  W2  =  (xi  —  *2)2/2,  ands(=  £21/2)  =  \xi  —  jcs|. 

8.  Calculate  the  first  three  moments  about  zero  for  the  distribution  of  Question  4. 
Then  obtain  the  mean,  variance  {kz),  the  standard  deviation  (/:21/2)  and  the  moment 
measure  of  skewness  (w3/m23/2)  for  this  distribution.  Hint:  Use  the  coded  variable 
u  =  x  -  7.45. 
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9.  Prove  that  if  we  have  a  set  of  Ni  values  xu(i  =1,2...  Ni)  and  another  set  of 
N2  values  xzjij  =  1,2...  Nz)  of  the  variate  X,  then  the  mean  of  the  combined  set  x  is 
given  in  terms  of  the  two  separate  means  xi  and  X2  by  the  relation  (Ni  +  N2)  x  = 

N1X1  +  N2X2. 

B  (§§  2.9-2.13) 

1.  A  variate  X  has  the  density  function  fix)  =  c  {\2x  +  x2  —  xs),  0  <  x  <  4, 
fix)  =  0  for  all  other  values  of  x.  Find  c  and  calculate  the  mean,  standard  deviation  and 
skewness  for  this  distribution.  Sketch  the  curve  for/Or). 

2.  A  -variate  X  has  the  density  function  f(x)  =  x/2  (0  <  x  <  1),  f(x)  = 
1/2  (1  <  x  <  2)  and/ 00  =  (3  —  x)/2  (2  <  x  <  3).  Find  the  variance  and  the  moment 
measure  of  kurtosis  of  X.  Hint:  This  distribution  is  symmetrical  about  x  =  3/2. 
Calculate  moments  for  u  =  x  —  1|. 

3.  A  continuous  variate  has  the  density  function/00  =  Cx1/2(\  —  jc)3/2,  0  <  x  <  1. 
Show  that  this  function  vanishes  with  infinite  slope  at  x  =  0,  vanishes  with  zero  slope 
at  x  =  1  and  has  a  maximum  at  x  =  1/4.  Sketch  the  curve.  Calculate  C  and  find 
the  mean  and  variance.  Hint:  Put*  =  sin20  and  use  a  reduction  formula  for  integration. 

4.  The  density  function  for  X  is  f(x)  =  cx2e~x(x  >  0).  Calculate  c  and  also  the 
mean  and  variance  of  X.   Find  the  cumulant  generating  function  for  X. 

5.  Find  the  moment  generating  function  for  the  triangular  distribution  of  Example 
2,  §  2.9. 

6.  A  distribution  has  the  m.g.f.  M{h)  =  (q  +  peh)N,  where  p  and  q  are  constants 
and  p  +  q  =  1 .   Find  the  c.g.f.  and  the  first  four  cumulants. 

7.  From  a  point  on  the  circumference  of  a  circle  of  radius  a,  sl  chord  is  drawn  in 
a  random  direction.  Show  that  the  expected  value  of  the  length  of  the  chord  is  4a/7r 
and  that  the  variance  of  the  length  is  2<z2(l  —  S/n2).  Hint:  See  Example  8,  §  1.16. 
Take  the  density  f(6)  as  1/n-. 

8.  If  a  variate  can  take  any  value  from  0  to  1  with  equal  probability,  show  that  its 
standard  deviation  is  V3/6  =  0.289.  A  set  of  two-digit  random  numbers  such  as  that 
in  Appendix  B.l  may  be  regarded  as  giving  approximate  random  choices  from  the 
interval  (0,  1),  number  43  for  instance  being  read  as  0.43.  Use  the  result  of  Problem 
A-7  to  obtain  an  estimate  of  s  from  50  samples,  each  of  size  two,  taken  from  Table  B.l 
and  compare  with  the  theoretical  value. 

9.  The  Cauchy  distribution  is  defined  by  the  density  function/Or)  =  -•- — -, 

7r  (x  —  b)2  +  a2 

—  00  <  x  <  co,a>0.  Show  that  the  mean  and  variance  of  this  distribution  do  not 
exist,  but  that  the  mean  is  b  if  the  improper  integral  defining  it  is  interpreted  as  the 
Cauchy  principal  value  (see  Appendix  A. 3). 

a/i8\a+1 

10.  The  Pareto  distribution  is  defined  by  f(x)  =  -5  -        (x  >  /3),  and  f{x)  =  0 


-m~ 


otherwise.   Show  that  the  rth  moment  exists  only  if  a  >  r.   Find  the  expectation  and 
variance  of  x  if  a  >  2. 

11.  Find  the  median  and  the  mode  of  the  distribution  with  density  f(x)  =  abxa~x 
(1  +  bxaY2,  b  >  0,  a  >  \,  0  <  x  <  oo.   Hint:  The  median  is  that  value  x  for  which 

I   f(x)  dx  =     "/OO  dx.  The  mode  is  that  value  x  for  which  fix)  is  a  maximum. 

12.  Prove  that  the  characteristic  function  of  the  Laplace  distribution,  with  density 
fix)  =  ie-|xi(-  00  <  x  <  00)  is  CQi)  =  (1  +  A2)"1.  Calculate  the  variance  of  this 
distribution. 

13.  A  discrete  variate  X  has  a  distribution  in  the  population  defined  by  fix)  = 
Oil  —  6)x,  for  x  =  0,  1,  2  ...  ,  where  9  is  a  parameter  with  a  value  between  0  and  1. 
Calculate  the  probability  of  a  sample  of  N,  in  which  No  have  x  =  0,  TVi  have  x  =  1 , 
etc.  and  £  Ni  =  N. 
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Find  for  what  value  of  6  this  probability  is  a  maximum.  (The  value  so  defined  is 
called  a  maximum  likelihood  estimator  of  6.  See  further  in  §  6.1.  Hint:  The  sample 
mean  x  =  (M  +  27V2  +  3iV3  +  . . .  )/N. 

14.  Prove  the  statement  (2.12.7).   Hint:  Since  the  variates  Xj  are  independent,  the 
joint  density  function  is  the  product  of  the  separate  density  functions.  Therefore, 
Ml{K)  =  J  ...  J  ehLf(xi,  X2  .  .  .  xn)  dxi  .  .  .  dx^ 

=  J  exp[/*cixi]/i(xi)  dxi  J"  exp[/zc2X2]/2(x2)  ^2 ...  J  exp[/?CivXiv]/N(x^)  dxN 
where  /}(**)  is  the  density  function  for  Xj. 

C  (§§  2.14-2.19) 

1.  If  X  is  the  number  of  spots  showing  up  in  a  single  throw  with  a  good  die,  show 
that  the  Markov  inequality  gives  P(X  >  5)  <  0.7  and  the  Chebyshev  inequality  gives 
P{\X  -  3.5|  >  2)  <  35/48.  What  are  the  true  values  of  these  probabilities? 

2.  A  discrete  variate  X  can  take  only  the  values  x  =  1 ,  2,  3  ...  with  probability 
2~x  (this  is  a  geometric  distribution).  Prove  that  Chebyshev's  inequality  gives 
P(\X  -  2|  >  2)  <  \.  What  is  the  true  probability?  Hint:  1  +  4r  +  9r2  +  16r3  + 
...  =(1  +  r)/(l  -r)3,forr  <  1. 

3.  For  the  two  distributions  with  density  functions 

(a)/(x)  =  l,(0<*<l) 
(b)  f{x)  =  e~*(x  >  0) 
calculate  P(\x  —  /x|  >  2a)  and  compare  with  the  value  given  by  Chebyshev's  inequality. 

4.  If  X  and  Y  are  independent  random  variables  with  density  functions  fix)  = 
Cixme-X/2,  g{y)  =  C%yne~yl2  respectively,  show  that  the  density  function  of  W  =  X  + 
Y  is  h(w)  =  C3\vm+n+le-w/2.  Hint:  See  Appendix  A.6. 

5.  If  X  and  Y  are  variates  with  joint  density  function  f(x,  y)  and  if  U  =  7/Jf, 

use  the  method  of  §  2.18  to  find  the  density  function  for  U.  Show  that  h(u)  =  \x\ 

J  -00 
f(x,  ux)  dx.  Hint:  Draw  the  line  Y  =  uX  in  the  X-  Y  plane,  and  show  the  areas  corres- 
ponding to  U  <  u.   Note  that  U  <  u  implies  Y  <  uX  if  X  >  0  but  Y  >  uX  if  X  <  0. 
Find  the  distribution  function  for  U  and  differentiate  to  get  hiu). 

6.  In  Problem  C-5  above,  suppose  that  X  and  Y  are  independently  and  uniformly 
distributed  on  the  interval  (0,  1),  so  that/Cx,  y)  =  1  everywhere  inside  a  unit  square 
and  fix,  y)  =  0  outside.  Prove  that 

hiu)  =  0,        «  <  0 

/*(a)  =  1/2,        0  <  u  <  1 

Mm)  =  l/(2«2),        W  >  1 
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Chapter  3 

THE  BINOMIAL,  POISSON  AND  NORMAL 
DISTRIBUTIONS 

3. 1  The  Binomial  (or  Bernoulli)  Distribution  Suppose  that  all  the  individuals 
in  a  population  are  divided  in  imagination  into  two  sets  according  as  they  have, 
or  do  not  have,  a  certain  attribute  A.  Such  a  division  is  called  a  "dichotomy"  (a 
cutting  in  two) — every  individual  belongs  either  to  the  one  set  or  to  the  other. 
The  attribute  A  is  often  conventionally  called  a  "success;"  it  may,  for  example, 
be  "head"  in  a  population  of  coin-tosses  or  "male"  in  a  population  of  children. 
We  assume  that  there  is  a  definite  probability  9  that  an  individual  chosen  at 
random  from  the  population  has  the  attribute  A.  This  probability  may  be 
estimated  by  taking  a  sample  of  N  individuals  and  noting  the  number  X  which 
are  A's.  The  ratio  Xj 'N  is  called  the  relative  frequency  of  success,  and  will  be 
denoted  by  p.  It  is,  of  course,  a  random  variable  and,  as  an  estimate  of  0,  is  more 
reliable  the  larger  the  sample  size.* 

The  binomial  distribution  is  concerned  with  the  variation  of  X  or  p  among 
samples  of  size  N  from  a  population  characterized  by  the  parameter  9.  It  is 
assumed  that  the  probability  of  success  is  unchanged  by  the  process  of  selecting 
an  individual  for  the  sample,  so  that  we  must  assume  either  that  the  population 
is  infinite  or,  if  it  is  finite,  that  the  sampling  is  done  "with  replacements"  (see 
§  1.12).  Furthermore,  each  item  for  the  sample  is  supposed  to  be  chosen 
independently  of  all  the  rest. 

Under  these  conditions  the  probability  that  the  first  x  individuals  selected 
will  all  be  A's  is  9X  and  the  probability  that  the  next  N  -  x  will  all  be  not-^'s 
is  (1  -  9)N~X.  The  probability  of  a  set  of  x  A's  followed  by  a  set  of  N  —  x 
not-^'s  is  6X(\  —  9)N~X,  and  this  is  also  the  probability  for  any  other  pre- 
selected arrangement  of  x  A's  and  N  —  x  not-^'s.  However,  we  are  not  inter- 
ested in  the  precise  arrangement  of  A's  and  not-^'s,  merely  in  the  total  number 
of  A's  in  the  sample.  Hence  we  can  combine  together  the  probabilities  for  all  the 

I  J  permutations  of  x  successes  and  N  —  x  failures,  and  state  that  the  proba- 
bility of  x  successes,  no  matter  in  what  order  successes  and  failures  occur,  is 
given  by 

(3.1.1)  b(x,  N,  9)  =  (^\  9X(\  -  9)N'X 

To  stick  to  our  Greek  and  Latin  convention,  the  symbol  for  the  probability  should  be 
it  instead  of  6,  to  correspond  with  the  sample  statistic  p.  But  the  risk  that  tt  may  be  misinter- 
preted as  3.14159  ...  is  serious. 
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where  x  =  0,  1,2...-  iV.  This  is  the  binomial  distribution,  discussed  by  James 
Bernoulli  (1 654-1705)  in  a  book  published  in  1713,  after  his  death.  It  gives  the 
probability  that  X  has  a  value  exactly  x.  It  is  called  binomial  because,  if  we 
write  1  —  6  as  (/>,  b(x,  N,  6)  is  the  term  containing  9X  in  the  expansion  of  the 
binomial  (<p  +  9)N.  The  probability  that  X  =  x  is  of  course  the  same  as  the 
probability  that  p  =  x/N. 

Table  3.1 


X 

512  6(;t,  9,  i) 

512  B(x,  9,  i) 

0 

1 

512 

1 

9 

511 

2 

36 

502 

3 

84 

466 

4 

126 

382 

5 

126 

256 

6 

84 

130 

7 

36 

46 

8 

9 

10 

9 

1 
512 

1 

Example  1     For  a  good  coin  we  may  take  6  =  £,  so  that  the  probability  of  x 
heads  in  nine  tosses  will  be 

b(x'9'i)  =  x\(9-xy.(i)9'  x=0'1'2--9 

By  giving  x  all  possible  values  we  obtain  Table  3.1,  in  which  the  probabilities 
have  been  multiplied  by  29  =  512  so  as  to  avoid  fractions. 
The  cumulative  binomial  probability  0.25  - 
is  usually  defined  as 

N 

(3.1.2)         B(x,  N,6)=  X  b{u, 


N,0) 


0.20 


It  is  the  probability  of  at  least  x 
successes.  Values  for  N  =  9  and 
6  =  \  are  given  in  Table  3.1.  The  0.15 
probability  of  at  least  six  heads  in 
nine  tosses,  for  example,  is  130/512  = 
0.254.  Note  that  the  distribution  °-10 
function,  as  defined  in  §  1 . 1 6,  is  1  — 
B(x  +  1,  N,  6). 

The  binomial  distribution,  even  0-05 
though  X  is  discrete,  may  be  repre- 
sented by  a  histogram  in  which  rec- 
tangles of  unit  base,  centered  at  jc  = 
0,  1  ...  N,  are  drawn  with  heights 
equal  to  b(x,  N,  6).  The  histogram  for      Fig.  16    Binomial  distribution,  6  =  0.5 


b(x,9,i) 

t 


0123456789 
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Table  3.1  is  shown  in  Figure  16.  Since  all  the  observations  are  actually  at  the 
centers  of  the  intervals,  there  is  no  grouping  error  (see  §  2.7). 

Numerical  values  of  b(x,  N,  9)  for  given  x,  N  and  0  may  be  calculated  by 
means  of  logarithms  of  factorials.  Seven-figure  tables  of  log  n !  for  n  —  1  to 
1000  are  given  in  Glover's  Tables  [1]  and  Biometrika  Tables  [2].  Extensive  tables 
of  the  cumulative  binomial  distribution  are  now  available  (see  [3]  and  [4]). 
Separate  terms  of  b(x,  N,  0)  may  be  obtained  from  these  tables,  if  desired,  by 
differencing  successive  entries,  since  b(x,  N,  9)  =  B(x,  N9  9)  —  B(x  +  1,  N,  0), 
but  for  most  practical  purposes  the  cumulative  probabilities  are  more  useful. 
It  is  not  necessary  to  tabulate  values  for  9  beyond  0.5,  since 

(3.1.3)  B{x,.N,  1  -  9)  =  1  -  B(N  -  x  +  1,  A/,  9) 

3.2  Recursion  Formula  for  Binomial  Probabilities  From  Eq.  (3.1.1)  we  see, 
by  cancelling  common  factors,  that 

6(x,  N9  9)         N-x  +  1      9  x  -  (N  + 1)9 

(3'21)  b(x-l,N,9)~         x         \-9~  x(l-0) 

The  ratio  of  b(x,  N,  9)  to  b(x  —  l,N,6)is  therefore  greater  than,  or  less  than,  1, 
according  as  jc  <  (N  +  1)9  or  x  >  (N  +  1)0.  The  values  of  b  increase  with  x 
as  long  as  x  is  below  {N  +  1)0  and  decrease  with  x  when  x  is  above  (N  +  1)0. 
If  x  =  (N  +  1)0,  the  probabilities  that  X  =  x  and  that  X  =  x  —  1  are  equal. 
This  is  the  case  for  Example  1  when  x  =  5.  In  general  the  probability  is  a 
maximum  when  X  is  equal  to  the  integer  next  below  (N  +  1)0. 

3.3  Moments  and  Cumulants  for  the  Binomial  Distribution  In  calculating 
moments,  etc.,  for  the  binomial  distribution,  it  is  convenient  to  use  the  concept 
of  indicator  function,  defined  in  §  1 .  13.  If  the  event  Aj  is  that  of  selecting  the/h 
item  for  the  sample,  and  if  IAj  =  1  when  Aj  is  a  success  and  IAj  =  0  when  Aj 
is  a  failure,  the  number  of  successes  X  is  simply  £JL  x  IAj.  By  Theorem  1.15, 
E(IA .)  =  P(Aj),  which  is  0  for  each  item.  Therefore, 

(3.3.1)  ft  =  E(X)  =  X  E(IAj)  =  J  P(Aj)  =  N0 

j  j 

which  gives  the  mean  of  the  binomial  distribution.  The  variance  a1  is  similarly 
obtainable  from  Bienayme's  Theorem  (§  2.14),  according  to  which 

(3.3.2)  <j2  =  V(X)=ZV(IAj) 

j 
where 

V(IA)  =  E(IAj  -  0)2  =  E(IAj2)  -  29E(IAj)  +  02 

But  IAj2  takes  exactly  the  same  values  as  IAj,  namely,  0  and  1,  so  that  E(IAj2) 
=  E(IAj)  =  0.   Therefore,  V(JA)  =  0  -  02,  and  from  equation  (2), 

(3.3.3)  <72=£(0-02)  =  N0(l-0) 

J 
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which  is  the  required  variance.    Since  p  differs  from  X  only  by  the  constant 
factor  1/N,  it  follows  that 

(3.3.4)  E(p)=±E(X)  =  0 

1  0(1  -  0) 

(3.3.5)  K(p).jjjKffl--^ ' 

The  moment  generating  function  for  IAj  is 

M/fc)  =  E(exp[hIAjT)  =  X  exp[W^]P(J^) 
=  e°-(l-0}  +  e*-0 
since  IA  is  either  0  or  1 ,  with  probabilities  1  —  0  and  0  respectively.  Therefore, 

Mj(h)  =  1  -  0  +  Qeh 
The  m.g.f.  for  Xis  therefore  given — see  Eq.  (2.12.7) — by 

(3.3.6)  M(h)  =  [Mj(h)Y  =  (1  -  0  +  0e*)N 
The  cumulant  generating  function  is 

(3.3.7)  K(h)  =  log  M(h)  =  N  log  (1  -  0  +  0e*) 


xrl      /       nf      0/i2      Oh3  \ 

Nlog^l+0fc+— +—  +  ... J 


If  the  logarithm  is  expanded  in  a  series  of  powers  of  h,  the  first  four  successive 
cumulants  are  found  to  be 

(3.3.8)  Kl=N9  =  fi 

k2  =  jV0(l  -  0)  =  a2 

k3  =  N0(1  -  0)(1  -  20)  =  <r2(l  -  20) 

k4  =  JV0(1  -  0)(1  -  60  +  602)  =  <r2(l  -  60  +  602) 


-4-%) 


The  sfcewwess  is  (k3/k2)3/2  =  (1  —  26)/ a  and  therefore  is  zero  only  when 
0  =  J.   For  small  values  of  0  the  distribution  has  a  positive  skewness,  which 
diminishes,  however,  as  N  increases,  since  a  is  proportional  to  N1/2. 
The  kurtosis  is 


If  0  =  1/2,  <j2  =  N/4,  and  k4/k22  =  -2/N. 
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By  differentiating  with  respect  to  0  the  expression  for  the  rth  moment  \xr, 
namely, 

(3.3.10)  jxr  =  X  (N)ex(l  -  9)N~x(x  -  NO)' 

x=0\X / 

we  may  obtain  a  recursion  formula  for  the  moments,  namely, 

(3.3.11)  jUr+i=0(l-0)[N^r_1+^] 

From  this  formula, 'starting  with  fi0  =  1  and  fil  =  0,  all  subsequent  binomial 
moments  may  be  calculated  in  turn  by  giving  r  the  values  1,2,3...  . 
A  still  simpler  recursion  formula  is  that  for  the  cumulants, 

(3.3.12)  Kr+1=  6(1-0)-^,        r>\ 

from  which,  starting  with  k1  =  NO,  the  higher  cumulants  may  readily  be 
obtained.  (See  hint  to  Problem  A-7.) 

3.4  The  Bernoulli  Law  of  Large  Numbers  Let  pN  be  the  relative  frequency 
of  success  in  N  independent  trials,  the  probability  of  success  0  being  the  same 
in  all  trials.  Since  E(pN)  =  9,  we  have  by  Chebyshev's  inequality,  §  2.16: 

(3.4.1)  P{\pN  -  9\  >  X)  <  -2 

Here  a2  =  0(1  -  0)1  N,  and  since,  for  all  9  between  0  and  1,  0(1  -  0)  <  1/4, 
Eq.  (1)  becomes 

(3.4.2)  P(\pN  -9\>X)<  1/(4AU2) 

For  any  fixed  X  >  0  and  any  given  e  >  0,  we  can  always  take  TV  so  large  that 
\/(4NX2)  <  e.  This  means  that  for  large  enough  TV  the  probability  that  pN 
differs  from  0  by  any  fixed  amount,  however  small,  can  be  made  as  near  to  zero 
as  we  like.  This  is  sometimes  expressed  as  "/?N  converges  in  probability  to  the 
value  0."  Note  that  this  is  not  the  same  thing  as  ordinary  mathematical  con- 
vergence (see  §  1.2).  The  law  expressed  by  equation  (2)  is  a  form  of  the  weak  law 
of  large  numbers. 

The  number  Af  given  by  this  equation  may  be  quite  large  when  X  and  s  are 
small.  Thus  if  X  =  0.01  and  e  =  0.001,  we  find  that  N  >  2,500,000.  As  we  shall 
see  in  §3.12,  the  approximation  of  the  binomial  distribution  by  the  normal 
distribution  permits  us  to  find  a  much  smaller  TV  satisfying  the  requirements. 
The  value  above  is  certainly  sufficient,  but  not  necessary. 

*  3.5  Non-Bernoulli  Sampling  The  two  chief  variants  from  the  true 
Bernoulli  (binomial)  sampling  scheme  described  above  are  (1)  the  Poisson  scheme 
in  which  the  probability  of  success  Oj  at  the  jth  trial  varies  from  one  trial  to 
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another,  and  (2)  the  hypergeometric  scheme  (sampling  without  replacements 
from  a  finite  population)  in  which  the  probability  at  any  stage  depends  on  the 
results  of  the  previous  trials. 

For  the  Poisson  scheme,  we  have  instead  of  (3.3.1), 

(3.5.1)  fi=E(X)=Y,P(Aj)=Yjej 

j 
Also, 

(3.5.2)  a2  =  I  V(IA)  =1(0,-0/) 

J  j 

If 


6=-Y0j 

N^    J 

and  if 

<7e2^-T(0,-0)2=iy0,2 

9     NyK  J      }     n ^  J 

we  have 

(3.5.3) 

a1  =N6-  N(<r02  +  62) 

=  N6(l  -  0)  -  N(re2 

Here  0  is  the  mean  of  the  6jf  and,  oe2  is  their  mean  square  difference  from  the 
mean.  This  shows  that  in  the  Poisson  scheme  the  variance  of  X  is  less  than  it 
would  be  if  the  probability  of  success  were  constant  over  all  trials. 

In  the  hypergeometric  scheme,  suppose  we  have  a  finite  population  of  size 
M,  in  which  the  number  of  "successes"  is  S  and  the  number  of  "failures"  is  F, 
with  S  +  F  =  M.  The  probability  of  success  at  the  first  trial  is  6  =  S/M.  The 

total  number  of  possible  different  samples  of  size  TV  is  [      I .    The  number 

containing  x  successes  and  N  —  x  failures  is  I     II  I ,  so  that  the  probability 

of  exactly  x  successes  in  a  sample  of  size  TV  is 

(3.5.4)  h(x,N,M,S)=\X""-X> 

\N) 
This  may  be  written 

S\F\N\(M  -  N)\ 


(3.5.5)  h(x,  N,  M,  S) 


x\  (S  -  x)\  (N  -  x)\  (F  -  N  +  x)\  M 


When  M  is  very  large  and  6  not  too  near  0  or  1  this  approximates  t<3 
b(x,  TV,  0). 

The  expression  on  the  right  of  (5),  when  multiplied  by  a  constant,  is  equal 
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to  the  coefficient  of  ux  in  the  series  expansion  of  the  hypergeometric  function 
F(a,  p,  y,  u)  with  a  =  —N,/3=  —S,y  =  F—N+  1.  Hence  the  name  of  the 
distribution  [5].  In  calculating  the  higher  moments  of  the  hypgrgeometric  distri- 
bution it  is  simplest  to  obtain  first  the  factorial  moments  (see  §  2.11).  The  first 
two  factorial  moments  are 


(3.5.6) 


/  N 

M'(i)  =  YsXh(x,N,M,6) 


A*'(2)  =  Z  x(x  ~  l)Kx,  N,  M,  9) 


and  from  these  we  easily  obtain 


(3.5.7) 


From  Eqs.  (5)  and  (6), 

S\F\N\(M-N)l   " 


(3.5.8)     M'a) 


M! 


i(x-l)!(S-x)!(N-x)!(F-N  +  x)! 
1 


5!F!N!(M-iV)!IV-1 

M~\  ^o  x\{S-l-x)\(N-l-x)\(F-N  +  l+x)\ 

(S-  1)\F\(N  -  1)!(M-  1  -  JV  +  1)! 

M  x%  x!(S~l-x)!(iV-l-x)!(F-JV  +  H  x)\(M  -  1)! 


SAT*"1 


SN  N~l 
=  TT  I   h(x,N-l,M-l,S-l) 
M  xt?0 

SN 
=  —  =N9 

M 


Similarly,  we  may  calculate 

S\F\Nl(M  -  N)\    " 


(3.5.9)    »\2)  = 


M\  x=2  (x  -  2)!(S  -  x)\(N  -  x)\(F  -  N  +  x)! 

S(S  -  i)N(N  -  1) 


M(M  -  1) 
y       (S-2)! 


F\(N  -  2)\(M  -  2  -  N  +  2)\ 


'o  x\(S  -  2  -  x)l(N  -  2  -  x)\(F  -  N  +  2  +  x)\(M  -  2)\ 

M(M  -  1)         x^o 

N6(S  -  1)(N  -  1) 
M  -  1 
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From  Eq.  (7),  we  have 

(3.5.10)  E(X)=fi  =  N0=  — 

M 

and 

(3.5.11)  V(X)  =  <j2=  n\2)  +  Nd(l  -  N€) 


=  N0 


(S  -  1)(N  -  1)      M  -  NS~ 


M  -  1  M 

NO 


M(M  -  1) 

M  -  N 


(M  -  S)(M  -  N) 


=  N0(1  -  0) 


M  -  1 


The  mean  is  therefore  exactly  the  same  as  in  pure  binomial  sampling,  but  the 
variance  is  less  because  of  the  factor  (M  —  N)/(M  —  1).  For  M  large  compared 
with  N  this  correcting  factor  is  nearly  1 . 

3.6  The  Poisson  Distribution  for  Rare  Events  If  in  a  binomial  distribution 
the  probability  of  success  is  very  small  (so  that  the  event  "success"  may  be  said 
to  be  rare)  but  the  size. of  the  sample  N  is  so  large  that  the  expected  number  of 
successes  in  the  sample  is  moderate  (say  between  0.1  and  10),  the  probability  of 
exactly  x  successes  is  given  approximately  by 

(3.6.1)  p(x,  ii)  =  ix*6— 

where  /x  =  NO.    The  theoretical  distribution  given  exactly  by  Eq.  (1),  for  all 
integral  values  of  x  from  0  onwards,  is  called  the  Poisson  distribution. 
The  true  (binomial)  probability  of  x  successes  is 

(3.6.2)  b(x,Ni9)  =  xl^[xVe%l^er- 

_  N(N  -  1)  .  .  .  (N  -  x  +  1)  (AXL  _  M*~x 
x\  \n)   \        N/ 

-s(-s)('-s)-(-^)Mr 

where  each  of  the  x  factors  N,  N  —  1  .  .  .  (N  —  x  +  1)  has  been  divided  by  one 

of  the  factors  of  N*. 

In  this  equation  we  suppose  that  x  is  a  fixed  number  but  that  N  tends  to 

infinity  and  6  to  zero  in  such  a  way  that  NO  tends  to  the  fixed  value  ft.  All  the 

1            2                x  —  1    /         fi\~x 
factors  1 ,  1 ...  1 ,  ( 1 -|      tend  to  the  value  1,  but  the 

N  N 


N    '[        N) 
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limit  of  1 1 )    is  the  number  e  *  (see  mathematical  Appendix  A.l).  Hence 

lim  b(x,  N,  6)  =  p(x,  fi). 

\9->0  J 

Instances  of  the  Poisson  distribution  might  be  the  number  of  blind  births 
per  year  in  a  large  city,  the  number  of  occurrences  of  hands  containing  four  aces 
in  an  evening  of  bridge  at  a  club,  or  the  number  of  typographical  errors  per  page 
in  professionally  typed  material.  The  numbers  of  births,  hands  of  bridge  or 
typed  symbols  will  be  large,  and  the  probability  of  the  rare  event  described 
(blind  birth,  hand  with  four  aces  or  error)  is  small,  but  there  may  well  be  a  few 
such  events  in  each  instance.  Considering  the  births,  for  example,  we  assume 
that  these  are  independent,  that  the  probability  of  a  blind  birth  remains  con- 
stant, and  that  the  total  number  of  births  per  year  in  the  region  considered  is 
approximately  constant.  If  so,  the  annual  number  of  blind  births  in  the  region, 
as  recorded  over  a  period  of  several  years,  should  fluctuate  approximately  in 
accordance  with  a  Poisson  distribution. 

3.7  Moments  and  Cumulants  of  the  Poisson  Distribution  For  the  theoretical 
distribution  given  by  Eq.  (3.6.1),  x  may  take  all  integral  values  0,  1,  2  ...  .  It 
may  be  noted  that 

oo  oo   J.* 

(3.7.1)  ElfolO-er'I^-l 

jc  =  0  0    XI 

as  it  should  be,  since 

oo   „* 
o  x\ 
The  expectation  of  X  is 

(3.7.2)  E(X)=fxp(x,fi) 

o 

00  /xx_1e~M 


=  P  X  7 7T1  =  Pe   "*"  =  A* 


i  (*-l)! 


The  moment  and  cumulant  generating  functions  may  be  found  from  those 
of  the  binomial  (§  3.3)  by  writing  6  =  n/N  and  letting  TV  tend  to  infinity.  Thus 

r  (e*  -  \)i" 

(3.7.3)  M(h)  =  lim    1  +  \i  ———L 

=  QXp[n(eh  —  1)],  by  Appendix  A.l 

(3.7.4)  K(%)  =  log  M(h)  =  fi(eh  -  1) 

2  L3 


=  T  +  2!  +  3!+-) 
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All  the  cumulants  are  therefore  equal  to  fi.  In  particular,  the  variance  is  fi,  the 
skewness  is  k3/k23/2  =  fi-fi~3/2  =  fi~1/2,  and  the  kurtosis  is  kJk22  =  \i~x. 

3.8  Tables  of  the  Poisson  Function  Tables  of  p(x,  fi)  for  values  of  \i  between 
0.1  and  15  may  be  found  in  Biometrika  Tables  [2].  More  extensive  tables  have 
been  calculated  by  Molina  and  by  Kitagawa  [6].  The  former  has  also  tabulated 
the  cumulative  probabilities : 

(3.8.1)  P(c,A*)=  tp(x>») 

x  =  c 

As  an  approximation  to  the  cumulative  binomial  for  moderately  small  6, 
the  cumulative  Poisson  function  P(x,  fi)  is  improved  by  subtracting  a  term 
proportional  to  p(x  —  1,  ju).  As  shown  by  Gram  and  Charlier, 

(3.8.2)  B(x,  N,  0)  «  P(x,  fi)  -  i0(x  -  1  -  fi)p(x  -  1,  fi) 

Example  2  For  N  =  10,  9  =  0.1,  and  x  =  3,  we  have  fi  =  1,  B(3,  10,  0.1) 
=  0.07019,  P(3,  1)  =  0.08030.  The  correcting  term  is  -0.05/>(2,  1)  =  -0.00920, 
which  makes  the  approximation  0.07110,  and  so  improves  it  considerably. 

*  3.9  The  Poisson  Distribution  of  Random  Events  The  clicks  heard  in  a 
Geiger  counter  at  a  chosen  location  may  be  regarded  as  produced  by  indepen- 
dent events — the  passage  of  cosmic  rays  or  particles  from  a  radioactive  source 
through  the  counter.  Also  the  result  is  practically  independent  of  the  precise 
time  of  observation  t  (assuming  that  the  radioactive  source  is  relatively  long- 
lived).  Consider,  again,  the  arrival  of  incoming  calls  at  a  telephone  switch- 
board. Except  in  special  circumstances  of  national  or  local  excitement,  the 
calls  may  be  regarded  as  practically  independent  of  one  another.  The  hypothesis 
that  they  are  also  independent  of  t  is  more  dubious  since  there  are  slack  times 
during  the  day,  holidays,  etc.)  but  one  five-minute  period  will  probably  be  very 
like  another  during  the  regular  office  hours,  Monday  to  Friday. 

These  are  examples  of  sequences  of  independent  physical  events,  each  of 
which  has  a  well-defined  probability  of  occurring  in  an  interval  of  time  St  (from 
/  to  t  +  St),  where  this  probability,  although  it  depends  of  course  on  St,  may  be 
considered  independent  of  t.  On  these  assumptions  it  is  a  simple  matter  to  show 
that  the  distribution  of  the  number  of  events  X  occurring  in  a  fixed  interval  Tis  a 
Poisson  distribution.  Even  though  the  assumptions  may  not  be  fully  justified, 
the  distribution  seems  in  practice  to  be  very  nearly  Poisson.  Certainly  telephone 
engineers  have  found  that  calculations  based  on  this  distribution  are  very  useful 
in  designing  switchboards  to  accommodate  expected  telephone  traffic. 

The  proof  that  the  distribution  is  Poisson  goes  as  follows.  Let  p(t)  be  the 
probability  that  exactly  one  event  (such  as  a  click)  occurs  in  a  time  interval  of 
length  /.  Also  let  q(t)  be  the  probability  that  no  such  events  occur  and  r(t)  the 
probability  that  more  than  one  event  occurs,  in  the  interval  t.  Since  these  three 
possibilities  are  mutually  exclusive  and  exhaustive, 

(3.9.1)  p(t)  +  q(t)  +  r(t)  =  l 
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It  is  reasonable  to  suppose  that  #(0)  =  1,  since  no  events  will  occur  in  an 
interval  of  zero  length.  We  assume  also  that  q\t)  tends  to  the  value  —  a  {a  >  0) 
as  t  ->  0,  where  q\t)  =  d  q(t)/dt.  This  means  that  q(t)  decreases  as  t  increases 
from  zero.  Furthermore,  we  will  suppose  that  r(t)/t  ->  0  as  t  ->  0,  which 
means  that  the  probability  of  more  than  one  event  in  the  interval  of  length  t  tends 
to  zero  even  more  rapidly  than  t  itself.  (If  the  probability  of  one  event  in  a  very 
short  interval  is  small,  the  probability  of  two  or  more  events  in  the  same  short 
interval  will  be  of  a  higher  order  of  smallness).  With  these  assumptions  we  can 
show  that  the  probability  of  exactly  x  events  in  the  interval  t  is  given  by 

(3.9.2)  p(x,at)=(at)xe-atlx\ 

which  is  the  Poisson  distribution  with  parameter  at. 

Let  X  denote  the  variate  "number  of  events  (such  as  clicks)  occurring  in  an 
interval  of  length  t"  and  let  n  be  any  fixed  positive  integer.  Subdivide  the 
interval  into  n  non-overlapping  sub-intervals  each  of  length  t/n.  Let  E  be  the 
event  "in  exactly  x  of  these  sub-intervals  just  one  click  occurs"  and  Fthe  event 
"two  or  more  clicks  occur  in  at  least  one  of  the  sub-intervals."  Then  if  E  occurs 
and  not  F9  the  value  of  X  will  be  x ;  and  if  the  value  of  X  is  jc,  either  E  or  F 
must  occur.  That  is, 

(3.9.3)  Ec\F  c(X=x)aE\jF 
By  Theorems  1.6  and  1.7, 

(3.9.4)  P(E  n  F)  <  P(X  =  x)  <  P(E  u  F)  <  P(E)  +  P(F) 
But  P{E)  =  P(E  nF)+  P(E  n  F) 

<  P(F)  +  P(E  n  F) 
so  that 

(3.9.5)  P(E)  -  P(F)  <  P(E  n  F) 
From  (4)  and  (5),  we  obtain 

(3.9.6)  P(E)  -  P(F)  <  P(X  =  x)<  P(E)  +  P(F) 

We  will  now  show  that  P(F)  -*■  0  as  n  ->  oo,  from  which  it  follows  that 
P(X  =  x)  =  P(E).  Let  Ft  be  the  event  "two  or  more  clicks  occur  in  the  ith 
sub-interval."  Then  F  =  |^Jf  Ft  and 


(3.9.7)  P(F)^p(\jF)<YJP(Fi 


In  the  notation  of  Eq.  (1),  P(Fi)  =  r(t/n),  and  since  this  is  the  same  for  each 
subinterval, 


^  /A         rCtln 

(3.9.8)  YP(Fi)  =  nr(-)=t-^- 

i  \nj  t/n 
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which,  by  the  assumptions  made  above,  tends  to  the  value  0  as  n  -*  oo.  Thus 
from  Eqs.  (7)  and  (8)  we  see  that  P(F)  -►  0,  as  stated. 

Now  P(E)  is  the  probability  of  exactly  x  successes  in  n  independent  trials  of 
an  event  (namely  the  occurrence  of  just  one  click  in  a  sub-interval  of  length  tin) 
of  which  the  probability  of  success  in  a  single  trial  is  p(t/ri).  By  the  binomial 
theorem, 

(3.9.9)  P(E)  =  Qpx(l-p)n-x 

where  p  stands  for  p(t/n).  Also  by  the  assumption  regarding  q{t)  and  its  de- 
rivative, 

(3.9.10)  q(0)—hm =*  hm =  —a 

t^o  t  t-*o         t 

so  that 

(3.9.11)  q(t)  =  l  -at  +  te(t) 

where  e(t)  -*  0  as  r  -*  0.  Applying  this  result  to  the  interval  t/n,  we  have 

(3.9.12)  ^(")  =1  -— +-e(-) 

\n/  n       n    \n/ 

By  (1), 

(3.9.13)  p(i)=  !_,(£)_,(!) 

----.(-j-'P) 

n  L  \nj         t/n  J 

n 

where  bn  =  a  —  e(r/«)  —  K(t/n)/(t/n),  which  tends  to  the  value  a  as  «  ->  oo. 
Therefore  np(t/n)  ->  a/  (a  fixed,  positive  number)  as  rc  ->  oo.  But  this  is  just  the 
condition  for  the  Poisson  approximation  to  hold;  therefore, 

(3.9.14)  P(E)  -»  e~at  —  =  p(x,  at) 

x! 

Moreover,  the  probability  that  X  =  x  lies  between  P(E)  —  P(F)  and  P(E) 
+  P(F).  As  «  ->  oo  both  these  extremes  tend  to  the  value  p(x,  at),  and  therefore 
so  does  P(X  =  x).  This  is  the  Poisson  distribution  for  random  events. 

The  quantity  a  is  the  expected  number  of  events  (clicks)  in  unit  time ;  it  may 
be  estimated  from  the  ratio  N/T  where  N  is  the  total  number  of  clicks  occurring 
in  a  fairly  long  interval  T. 
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3.10 


3.10  The  Normal  Distribution  as  an  Approximation  to  the  Binomial    If  the 

binomial  probabilities  b(x,  N,  9)  are  plotted  against  x  for  different  values  of  N, 
it  will  be  found  that,  as  N  increases,  the  histograms  so  drawn  approximate  more 
and  more  closely  to  a  symmetrical  bell-shaped  curve  known  variously  as  the 
normal,  or  Gaussian,  curve,  or  the  curve  of  error.  The  normal  distribution, 
represented  by  this  curve,  plays  a  central  part  in  statistical  theory.  Since  the 
range  of  the  binomial  variable  X,  and  its  expectation,  both  increase  with  N,  the 
histograms  get  wider  and  flatter  and  move  further  to  the  right  as  TV  increases 
(see  Figure  17,  which  shows  outlines  of  the  histograms  for  9  =  \,  N  =  9,  16  and 
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Fig.  17    Three  binomial  distributions  with  increasing  N 
25).  To  avoid  this  it  is  convenient  to  use  instead  of  X  the  standardized  variate. 


(3.10.1) 


Z=(X-  ji)lo 


which  is  the  difference  of  X  from  its  expectation  expressed  in  units  of  the  standard 
deviation,  and  at  the  same  time  to  multiply  the  ordinates  b(x,  N,  0)  by  a.  Since 
a  is  proportional  to  the  square  root  of  N,  the  effect  of  the  change  of  scale  is  to 
compress  the  histogram  horizontally  and  extend  it  vertically,  and  the  change 
of  origin  keeps  the  center  at  z  =  0  for  all  values  of  N.  An  outline  of  the  histo- 
gram for  TV  =  50  is  shown  in  Figure  18,  along  with  the  limiting  normal  curve. 
Almost  the  whole  of  the  distribution  lies  within  about  three  standard  deviations 
on  either  side  of  the  mean,  between  z  =  +  3.  The  approximation  to  the  normal 
curve  is  much  better,  for  moderate  values  of  N,  when  9  is  near  0.5  than  when  it 
is  near  0  or  1 . 

The  probability  that  X  =  x  is  given  by  the  binomial  expression 


P(X  =x)  =  b(x,  N,  9)  = 


N\ 


-p# 


x\{N  -x)\ 

where  4>  =  1  —  9.  Taking  logs  (to  base  e),  we  have 
(3.10.2)    log  P  =  log  N !  -  log  x !  -  log(N  -  x) !  +  x  log  9  +  (N  -  x)log  <j> 
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With  the  use  of  Stirling's  approximation  for  the  logarithm  of  n !  (see  Appen- 
dix A. 2),  namely, 

(3.10.3)  log  n !  «  (n  +  i)log  n-n  +i  \og(2n), 
Eq.  (2)  becomes 

(3.10.4)  log  P  »  (N  +  i)log  N-(x  +  i)log  x 

-  (AT  -  x  +  i)log(N  -  x)  -  i  log(27i)  +  x  log  0  +  (N  -  x)  log  </> 

A 
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Fig.  18    Standardized  binomial  distribution  and  normal  curve  (0  =  0.5) 

If  we  now  change  to  the   standardized   variate,   putting   z  =  (x  —  jj)/g 
=  (x  -  N6)(N6(l))-i/2,  so  that 

(3.10.5)  x  =  NO  +  (iV00)1/2z 
and 

(3.10.6)  N  -  x  =  iV0  -  (N0<M1/2z 
we  find,  from  Eq.  (4), 

(3.10.7)  log  P  »  -±[log  N  +  log  0  +  log  <j>  +  log(27t)] 

-[NO  +  i  +  (JV*1/2z]  log[l  +  z(0/N0)1/2] 
-[M/>  +  i  -  (N0<£)1/2z]  log[l  -  z(0/N0)1/2] 

Expanding  the  logarithms  in  series  and  arranging  the  terms  on  the  right  of 
Eq.  (7)  in  powers  of  N~1/2,  we  finally  obtain  after  some  manipulation 

z2 

(3.10.8)  log  P  «  -i  log(2?riV0</>)  -  — 

+  ^iTi  IW)1/2  -  W>/0)1/23  +  ^tti  [«>  W/2  -  (03/<»1/2]  +  •  • . 
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where  the  unwritten  terms  are  of  order  N~K   For  large  TV  the  terms  of  order 
N~1/2  may  be  neglected  (and  they  vanish  identically  when  6  =  (j>  =  J).   If  so, 

z2 
log  P  «  -i  log(27TJV00) 

so  that 

(3.10.9)  P  »     .  <Tz2/2  -  — =  e-(x-n)2/(2.») 

V27rN0(/>  C7V27T 

The  limiting  form  for  aP  is  therefore  given  by 

(3.10.10)  \im<jP=-?=e-z2/2 

n-oo  V27r 


The  function  on  the  right  is  the  standardized  form  of  the  normal  distribution 
and  will  usually  be  denoted  by  <j>{z).  It  is  tabulated  in  Appendix  B  (Table  2). 
For  more  extensive  tables  see  references  [7]  and  [8]. 

3.11  Approximation  of  the  Cumulative  Binomial  by  the  Cumulative  Normal 
Distribution  The  distribution  function  for  the  standardized  normal  distribution 
is 


I 


(3.11.1)  3>(z)=         <Ku)du 

J    —  00 

where  the  integral  is  improper  but  converges  (see  Appendix  A. 3).  This  function 
represents  the  area  under  the  standardized  curve  from  —  oo  up  to  the  given 
value  z.  It  is  tabulated  in  Appendix  B  and  in  references  [7]  and- [8],  although  some 

of  these  tables  give  instead  of  <D(z)  the  integral       <j)(u)  du,  which  merely  differs 

C  0 

from  <D(z)  by  0.5.   (Because  of  the  symmetry  of  0(w)  about  u  =  0,  </>(w)  du 

J  -oo 

is  one-half  (j)(u)  du  and  this  latter  integral  is  equal  to  1  (see  Appendix  A. 7). 

J  -oo 

The  binomial  probability  that  X  >  x  is  given  exactly  by  B(x,  N,  6).  This  is 
approximately  equal  to  the  probability  that  Z  >  z,  where  Z  has  a  normal 
distribution  and  z  =  (x  —  \l)\g.  However,  because  the  binomial  distribution  is 
discrete  while  the  normal  distribution  is  continuous,  a  better  approximation  is 
given  by  putting  z  =  (x  —  (1/2)  —  /i)/cr.  In  Figure  19,  the  probability  B(x,  N,  9) 
is  given  by  the  sum  of  the  area  of  the  shaded  rectangles,  and  the  bases 
of  these  rectangles  extend  from  x  —  1/2  on.  It  can  be  shown  [9]  that  the  error 
involved  is  less  than  0.140/c\ 

Various  attempts  have  been  made  to  give  a  better  approximation  to  the 
value  of  z  which  is  such  that 

(3.11.2)  B(x,N,0)  =  l  -O(z) 
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A  simple  approximation  is  that  of  Freeman  and  Tukey  [10],  namely, 

(3.11.3)  z  «  2JVx(l  -  9)  -  V(/V  -  x  +  1)0] 

and  a  more  elaborate  one  is  due  to  Camp  and  Paulson, 

a 
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(3.11.4) 


where 


a  =  9  -  x~l  -  [9  -  (JV  -  x  +  1)_1] 


(JV-x  +  l)01"3 


x(l  -  0) 


r 


+  (N-x  +  ir1 


\N-x  +  1)0 


x(l  -  6)     J 


2/3 


B(x,N,d) 


0      12       /x  x 


N 


Fig.  19    Cumulative  binomial  probability 

Example  3  For  N  =  35,  6  =  0.30  and  x  =  15,  the  true  value  of  £(x,  AT,  0) 
is  0.07307,  and  the  true  corresponding  z  from  Eq.  (2)  is  1 .4533.  The  first  approxi- 
mation, (x  -  1/2  -  /i)/<7,  is  (14.5  -  10.5)/2.711  =  1.4755.  The  value  given  by 
Eq.  (3)  is  2[VIa5  -  V63]  =  1.4608,  and  that  given  by  Eq.  (4)  is  (1.3829)/ 
3(0.10054)1/2  =  1.4538.   This  last  one  is  extremely  accurate. 

To  approximate  the  binomial  probability  that  ^  <  X  <  x2,  namely, 
B(xl9  N,  6)  —  B(x2  +  1,  N,6),  we  can  in  the  same  way  use  zx  =  (xl  —  1/2  —  h)jg 
and  z2  =  (x2  +  1/2  —  fi)/(r  and  take  as  the  first  approximation 


(3.11.5) 


P(xt  <X< 


x2) «       4>(u 


)du 


Closer  approximations  can  be  obtained  by  using  equation  (2)  with  (3)  or  (4) 
for  B(xu  N,  0)  and  B(x2  +  \,N,0)  separately. 
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3.12  Bernoulli's  Law  of  Large  Numbers  (Using  the  Normal  Distribution) 

We  saw  in  §  3.4  that,  from  Chebyshev's  inequality, 


(3.12.1)  P(^~9    -X)~ 


(4AUZ) 


2\-l 


which  means  that  by  making  N  large  enough  we  can  be  practically  certain  that 
X/N  will  differ  from  0  by  as  small  an  amount  as  we  please.  However,  if  we  use  the 
normal  approximation  to  the  cumulative  binomial  we  can  achieve  the  same  result 
with  a  much  smaller  value  of  N. 

If  xt  =  N(6  -  X)  and  x2  =  N(9  +  X),  xl  and  x2  need  not  be  integers  when 
X  is  arbitrary.  The  probability  that  \X/N  —  0|  <  X,  which  is  the  probability 
that  X  lies  between  xl  and  x2,  is,  for  large  N,  close  to  the  probability  that  Z 
lies  between  zx  and  z2,  where  zt  =  —NX/a  and  z2  =  NX/a.  This  proba- 
bility is  \'2  (j>(u)  du  =  2G>(z2)  -  1,  since  zx  =  —z2.    The  required  probability 

PQX/N  -0|  >  X)  is  therefore  1  -  (2<D(z2)  -  1)  =  2(1  -  0>(z2)).  In  order  to 
make  this  less  than  some  fixed  e  for  a  given  X,  we  have  to  choose  z2  so  that 
G>(z2)  >  1  -  s/2.   This  provides  a  lower  limit  for  N. 

Thus  if  A  =  0.01  anda  =  0.001,  we  findthatz2  >  3.29.  Since<r2  =  N6(\  -  0), 
we  have  then 

(3.12.2)  N1/2/L>3.29[0(l-0)]1/2 

For  all  values  of  0,  0(1  —  0)  <  J,  so  that  this  inequality  will  certainly  be 
satisfied  if  N1/2X  >  1.645,  and  therefore  if  N  >  27,060.  This  is  considerably 
better  than  the  bound  on  Ar  (2,500,000)  obtained  in  §  3.4  by  the  use  of  Chebyshev's 
inequality. 

Example  4  If  0  =  f  and  N  =  600,  what  is  the  probability  that  the  relative 
frequency  of  success  will  differ  from  |  by  less  than  0.01  ? 

Here  o  =  [N0(\  -  0]1/2  =  12  and  NO  =  360.  Also  PQX/N  -  0.6|  <  0.01) 
=  i>(354  <  X  <  366)  «  P(zt  <  Z  <  z2),  whereZ  =  (X  -  360)/12,z1  =  -  0.458, 
z2  =  0.458.  This  last  probability  is  20>(0.458)  -  1  =  0.353,  which  is  the 
probability  required. 

3.13  Properties  of  the  Normal  Distribution  The  probability  density  for  the 
standardized  normal  distribution  is 

(3.13.1)  4>(z)  =  (2tt)-  ^2e~z2/\     -  oo  <  z  <  oo 

This  is  an  even  function,  <j>{z)  =  <j>(  —  z),  with  a  maximum  value  0.3989  at 
z  =  0.  The  quartiles  </>3  and  0!  are  at  z  =  ±0.6745,  since 


.6745 

(3.13.2)  <«z)  dz  =  0.25 


r 


3.13  THE  BINOMIAL,   POISSON   AND   NORMAL  DISTRIBUTIONS  69 

The  probability  is  therefore  0.5  that  z  has  a  value  between  -0.6745  and 
+  0.6745.  There  is  an  even  chance  that  a  normal  variate  lies  between  the  mean 
plus  0.6745  times  the  standard  deviation  and  the  mean  minus  0.6745  times  the 
standard  deviation. 

The  moment  generating  function  for  the  standard  normal  distribution  is 


J: 


(3.13.3)  M{h)=\      ehz<t>{z)dz 


fi-iuiy*  dz 


If  we  change  the  variable  of  integration  from  z  to  u,  where  u  =  z  —  h,  this 
becomes 


(3.13.4)  M(h)=(2nyl/2e 


/2Al/2)h* 


e-(1/2)"2  du 


_  g(l/2)/i2 


=  l+2rh2+j[(ih2)2+  ... 

The  moments  about  the- mean  are  therefore  jx2  =  1,  AU  =  3,  /i6  =  15,  etc.  The 
odd-order  moments  are  all  zero,  as  is  obvious  from  the  symmetry  of  the  distribu- 
tion about  z  =  0. 

The  cumulant  generating  function  is 

(3.13.5)  K(h)  =  log  M(h)  =  \h2 
The  only  non-zero  cumulant  is  therefore 

(3.13.6)  k2  =  1 

which  expresses  the  fact  that  the  variance  is  unity,  as  of  course  it  must  be  for  a 
standardized  distribution.  For  the  non-standardized  normal  distribution,  with 
mean  \i  and  variance  <r2,  we  have,  by  §  2.1 1, 

(3.13.7)  k1  =  //,     k2  =  o2 

The  great  simplicity  of  the  system  of  cumulants  for  the  normal  distribution  is 
one  reason  for  the  importance  of  this  distribution  in  statistical  theory. 

The  normal  curve  is  asymptotic  to  the  z-axis  as  z  ->  ±oo.  Practically,  it 
almost  touches  the  axis  beyond  z  =  +4.  Table  3.2  gives  the  proportion  of 
area  beyond  z  =  +  z0  for  a  few  selected  values  of  z0,  and  therefore  represents  the 
probability  that  a  standard  normal  variate  will  have  a  value  outside  the  given 
interval.  This  table  will  be  useful  later  on  in  problems  of  estimation  where  the 
variate  concerned  may  be  regarded  as  approximately  normal. 
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3.15 


3.14  Probability  Graph  Paper  The  graph  of  the  distribution  function  <J>(z) 
is  a  roughly  S-shaped  curve,  looking  something  like  the  cumulative  frequency 
polygon  of  Figure  14b,  but  of  course  smooth  throughout.  If  the  scale  on  the 
graph  paper  is  properly  adjusted,  this  curve  may  be  straightened.  In  Figure 
20,  the  data  of  Table  2.2  are  plotted  on  special  probability  graph  paper  [11]. 
The  scale  along  the  axis  of  x  is  uniform,  but  on  the  axis  representing  percentage 
cumulative  frequency  the  scale  is  compressed  in  the  middle  and  extended  near  the 
top  and  bottom  so  that  the  polygon  of  Figure  14b  becomes  almost  a  straight 
line.  The  points  marked  in  the  diagram  are  the  values  of  WOF/N,  or,  in  this 
case,  F/10,  plotted  against  corresponding  values  of  xe.  The  fact  that  these 
points  apparently  lie  close  to  a  straight  line  is  good  presumptive  evidence  that  the 
distribution  of  the  variate  X  (in  the  population  from  which  the  sample  is  taken) 
is  approximately  normal.  A  method  of  testing  this  presumption  will  be  dis- 
cussed later. 

Table  3.2 


Zo 

P(\z\  >  zo) 

zo 

P(\zo\  >  zo) 

1 

0.3173 

0.6745 

0.5 

1.5 

0.1336 

1.2816 

0.2 

2 

0.0455 

1.6449 

0.1 

2.5 

0.0124 

1.9600 

0.05 

3 

0.0027 

2.3263 

0.02 

3.5 

0.00046 

2.5758 

0.01 

4 

0.00006 

3.2905 

0.001 

If  a  straight  line  is  drawn  by  eye  as-  evenly  as  possible  between  the  plotted 
points,  one  can  make  a  quick  rough  estimate  of  the  median,  quartiles,  etc.  for  the 
distribution  (by  noting  the  values  of  x  corresponding  to  percentage  cumulative 
frequencies  of  50,  25,  75,  etc.).  One  can  also  estimate  readily  the  probabilities 
that  chosen  values  of  x  will  be  exceeded  in  the  population. 


*  3.15  The  Angular  Transformation  for  Binomial  Variates  If  a  variate  is 
thought  to  be  binomial,  transformations  such  as  those  given  in  equations  (3.1 1.3) 
and  (3.11.4)  will  aid  in  the  approximation  to  normality.  That  is,  the  quantity 
on  the  right  hand  side  of  each  of  these  equations  is  approximately  a  standard 
normal  variate.  However,  a  different  transformation  is  often  used  with  a 
different  purpose  in  mind,  namely  to  make  the  variance  more  nearly  constant 
(independent  of  6).  The  so-called  angular  transformation  which  is  appropriate 
here,  and  which  was  suggested  by  Fisher,  is 


(3.15.1) 


^=sin~1(p1/2) 


where  p  =  X/N,  the  observed  proportion  of  successes,  and  A  is  the  angle  (in 
degrees)  whose  sine  is  p1/2.  A  table  of  values  of  A  for  given  p  may  be  found 
in  [12]. 
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Fig.  20    Data  of  Table  2.2  plotted  on  probability  graph  paper 
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As  we  have  seen,  the  variance  of  p  is  0(1  —  0)1  N,  but  it  turns  out  that  the 
variance  of  A  is  approximately  821  /N,  whatever  0  may  be.  The  angular  trans- 
formation is  therefore  often  advisable  in  testing  experimental  results  by  the 
method  of  analysis  of  variance,  to  be  discussed  later,  in  which  constancy  of  the 
variance  in  different  circumstances  is  usually  assumed.  A  decidedly  non- 
rigorous but  simple  proof  of  the  effect  of  this  transformation  on  variance  goes 
as  follows : 

A  small  variation  SA  in  A  is  related  to  the  corresponding  variation  Sp  in  p  by 
the  equation 

dA 
(3.15.2)  SAa  —  dp 

dp 

180  1 


so  that 

(3.15.3)  (SA)2  »  (jj 


1/2(1  -  p)1'2  ^ 

(Sp)2        8 
p(i-p)     p(l- 


n    2pll2(\-p) 

(dp)2      m(Sp) 


Now  the  variance  ofp  may  be  regarded  as  the  expectation  of  (dp)2  where  Sp 
is  a  sampling  fluctuation  about  the  mean,  and  a  similar  interpretation  holds  for 
the  variance  of  A.  Therefore 

(3.15.4)  V(A)k     f21      V(p) 

P(l  -  P) 

Since  the  variance  ofp  is  approximately  p(\  —  p)/N,  we  obtain 

821 

(3.15.5)  K^)*^" 

A  more  precise  argument  [13]  shows  that  as  N  ->  oo  the  distribution  of  A 
does  in  fact  tend  to  a  normal  distribution  with  mean  sin-1  61/2  and  variance 
821/AT. 

A  slight  modification  of  the  transformation,  namely, 

(3.15.6)       A^sin-i(^y\sin-^f] 

gives  a  quantity  A  whose  variance  is  within  ±6%  of  821/ '(N  +  %)  for  almost  all 
binomial  distributions  with  NO  >  1. 

*  3.16  The   Square  Root  Transformation  for   Poisson   Variates    If  X  is   a 

Poisson  variate  with  mean  /x,  we  know  that  V(X)  —  \x  and  the  skewness  of  X, 
given  by  k:3/k23/2,  is  ju~1/2.  A  transformation  which  serves  to  stabilize  the 
variance  approximately  is 

(3.16.1)  Y=Xl/2 
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By  a  similar  argument  to  that  in  §  3.15,57  «  iZ_1/2  <5Xand(<57)2  «  — -  (SX)2, 
so  that  V(  7)  «  —  F(X)  «  -,  since  X  &  fi  (which  is  the  expectation  of  X). 

More  precisely  [13],  if 

(3.16.2)  (  7=VxT^        X>-a 
\  7=0,  X<  -a 

the  distribution  of  7  —  Vju  +  a  tends  as  jx  ->  oo  to  a  normal  distribution  with 
mean  0  and  variance  £.  If  a  =  0,  this  means  that  E(Y)  ->  ^1/2.  Actually,  for 
large  \x  and  a  =  0, 

e(7)  « o«  -  i)1/2>  n*)«* 

and  the  skewness  of  7  is  — 1/(2// /2)  approximately,  which  indicates  that  the 
normality  is  not  greatly  improved  by  the  transformation  (the  skewness  is  halved 
numerically). 

Bartlett  [14]  found  that  if  a  =  ^,  the  variance  is  usually  considerably  nearer 
to  \  for  moderate  values  of  p.  than  if  a  =  0,  and  he  recommended  the  use  of  the 
transformation  7  =  y/(x  +  ^).  Johnson  and  Anscombe  [15]  recommend 
a  =  f . 

Anscombe  [16]  has  pointed  out  that  a  better  transformation,  if  we  are 
interested  in  normalizing,  is 

(3.16.3)  Y=X2/3 

fi1/3 
The  variance  of  7  is  about  4——  and  the  skewness  is  of  order  \i     .   The 

expectation  of  7  is  {\i  —  i)2/3,  so  that  if  we  want  a  normal  approximation  to  the 
Poisson  cumulative  function  P(c,  ^u),  we  may  write 

(3.16.4)  P(c,  n)  =  1  -  <D(z) 
where 

(c  -  i)2'3  -  (ji  -  i)2'3 


(3.16.5)  z  as 


iKW 


The  term  c  —  \  is  used  instead  of  c  as  a  correction  for  continuity,  like  that  dis- 
cussed in  §  3.11. 

Example  5  Let  \i  =  4,  c  =  6.  The  true  probability  that  X  >  6  is  P(6,  4) 
=  0.2149.  ThevalueofzgivenbyEq.(5)isz.=  [(5.5)2/3  -  (23/6)2/3]/[(2/3)-41/6] 
=  0.7935  and  the  corresponding  probability  is  0.2138,  which  is  a  fairly  good 
approximation  to  the  truth. 
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PROBLEMS 
A.  (§§  3.1-3.3) 

1.  In  a  binomial  distribution  with  0  =  f ,  calculate  the  probability  that  the  number 
of  successes  in  six  trials  is  either  1,  2  or  3. 

2.  If  ten  "good"  coins  are  tossed,  what  is  the  probability  of  (a)  at  least  three  heads, 
(b)  not  more  than  three  heads  ? 

3.  Show  that  the  greatest  value  of  I   n  I  for  positive  integral  values  of  x  occurs 

when  x  =  n.  What  does  this  tell  us  about  the  binomial  distribution  with  6  =  |? 

4.  An  antiaircraft  battery  in  England  during  World  War  II  had  on  the  average 
three  out  of  five  successes  in  shooting  down  "flying  bombs"  that  came  within  range. 
What  was  the  chance  that  if  eight  bombs  came  within  range,  not  more  than  two  got 
through  the  barrage  without  being  shot  down? 

5.  A  and  B  play  a  game  in  which  ,4's  chance  of  winning  is  f .  In  a  series  of  eight 
such  games,  supposedly  independent,  what  is  the  chance  that  A  will  win  at  least  six  ? 

6.  If  the  probability  of  success  in  a  single  trial  is  0.01,  how  many  independent  trials 
are  necessary  in  order  to  have  the  probability  of  at  least  one  success  greater  than  £? 
Hint:  Find  n  so  that  1  -  (0.99)n  >  \. 

7.  Prove  the  relation  of  Eq.  (3.3.12)  for  successive  binomial  cumulants. 

dr  /     en  __  i      \1 


Hint:  kt  = 


dhr 


so  that  — ^  =  N 
do 


AISO,  Kr+1   = 


dhr+1 


[  dr  /        0eh        \1 
[dhr\l  -0  +  0e*)\ 


=  N\ 

8.  In  a  series  of  n  trials  of  a  binomial  distribution  the  numbers  of  successes  and 
failures  are  n\  and  ni.  Calculate  the  covariance  of  «i  and  m  and  the  variance  of  the 
difference  m/n  —  ni\n.  Hint:  C(ni,  nz)  =  E(mn2)  —  E(n\)E{ni).  For  second  part, 
see  §2.14. 

B.  (§§  3.4-3.5) 

1.  If  1,000  trials  are  made  of  an  event  with  probability  of  success  \  in  each  trial, 
find  the  Bernoulli  upper  limit  for  the  probability  that  the  proportion  of  successes  will 
differ  from  \  by  as  much  as  0.05. 

2.  How  many  trials  must  be  made  of  an  event  with  binomial  probability  of  success 
\  in  each  trial,  in  order  to  be  assured  (by  the  Bernoulli  law)  with  probability  at  least 
0.9  that  the  relative  frequency  of  success  will  be  between  0.48  and  0.52? 

3.  Suppose  that  in  a  Poisson  sampling  scheme  the  probability  of  success  on  the 
7th  trial  is  always  either  0  or  1 ,  and  that  in  Af  trials  there  are  Ni  cases  of  Oj  =  0  and  N2  of 
6j  =  1 .  Show  that  the  formula  of  Eq.  (3.5.3)  for  the  variance  of  the  number  of  successes 
reduces,  as  it  should,  to  zero. 

4.  If  pn  is  the  proportion  of  successes  in  N  independent  trials,  the  probability  of 
success  at  the  jih  trial  being  0j9  prove  that  pn  converges  in  probability  to  6,  where 

6  =  —  X  Of.   Hint:  Show  that  P(\pN  —  6\  >  A)  can  be  made  arbitrarily  small  for  any 

fixed  A  >  0. 

5.  Two  persons  are  picked  at  random  from  a  group  of  five  persons,  consisting  of 
three  men  and  two  women.  Let  X  represent  the  number  of  men  in  the  sample  picked. 
Write  down  the  probabilities  for  the  possible  values  of  X.  Calculate  the  expectation 
and  variance  of  Xand  so  verify  the  formulas  of  Eqs.  (3.5.10)  and  (3.5.11). 

C.  (§§  3.6-3.9) 

1.  A  Poisson  distribution  is  such  that  the  probability  is  the  same  for  X  —  1  and 
for  X  =  2.   What  is  this  probability? 
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2.  A  liquid  culture  medium  contains  on  the  average  /x  bacteria  per  milliliter.  Many 
samples  are  taken,  of  1  ml  each,  and  the  total  number  of  bacteria  in  each  sample  is 
counted.  Assuming  that  the  distribution  is  Poisson  and  that  10%  of  the  samples  are 
free  from  bacteria,  estimate  /x. 

3.  If  on  the  average  the  proportion  of  defective  fuses  in  a  large  consignment  is 
0.015,  calculate  the  approximate  probability  that  in  a  box  of  200  fuses  there  will  not 
be  more  than  2  defective. 

4.  A  seed  distributor  finds  that  on  the  average  5  %  of  his  seeds  will  not  germinate. 
He  puts  them  up  in  packages  of  100  and  guarantees  90%  germination.  Find  an 
approximate  expression  for  the  probability  that  a  given  package  will  violate  the  guaran- 
tee. 

5.  Suppose  that  the  number  of  telephone  calls  received  by  an  operator  in  a  par- 
ticular 5-min  interval,  say  from  9:30  a.m.  to  9:35  a.m.,  is  a  Poisson  variate  with  mean 
4.  Find  the  probability  that  on  a  future  working  day  the  operator  will  receive  in  this 
interval  of  time  (a)  not  more  than  one  call,  (b)  six  or  more  calls. 

6.  A  retailer  with  limited  storage  space  finds  that,  on  the  average,  he  sells  two  boxes 
of  parrot  food  per  week.  He  replenishes  his  stock  every  Monday  morning  so  as  to 
start  the  week  with  four  boxes  on  hand.  What  are  the  probabilities  that  (a)  he  sells  his 
entire  stock  in  a  week,  (b)  he  is  unable  to  fill  at  least  one  order?  With  how  many 
boxes  should  he  start  the  week  so  as  to  have  a  probability  at  least  0.99  of  being  able 
to  fill  all  orders?  Hint:  Assume  a  Poisson  distribution  of  sales  with  mean  2,  and  find 
the  probability  of  x  or  more  sales. 

7.  Show  that  if  A'  is  a  Poisson  variate  with  mean  /a,  then  E(X2)  =  \iE{X  +  1). 

8.  The  mean  absolute  deviation  (m.a.d.)  about  the  mean  for  a  variate  X  is  defined 
as  E(\  X  —  /x|).  Show  that  for  a  Poisson  variate  with  /x  =  1,  the  m.a.d.  is  2/e  times  the 
standard  deviation. 

9.  Prove  that  the  sum  of  two  independent  Poisson  variates  with  means  /xi  and  /X2 
is  Poisson  with  mean  /xi  +  /X2.  Hint:  Use  Eqs.  (2.12.8)  and  (3.7.4). 

D.  (§§  3.10-3.14) 

1.  From  the  tables  in  Appendix  B.2,  write  down  the  values  of  </>(\.15),  <j>{—  0.64), 
0(2.07)  and  0(- 1.63). 

2.  (a)  Determine  z  so  that  P    </>(u)  du  =  |;  (b)  If  <D(z)  =  0.43,  calculate  z.   Hint: 

Use  linear  interpolation  on  the  tabular  values. 

3.  A  variate  X  is  distributed  normally  with  mean  12  and  standard  deviation  2. 
Find  the  probability  that  X\\qs  between  9.5  and  13.0. 

4.  A  sample  of  size  1500  is  normally  distributed  with  ^  =  75  and  a  =  10.  Find  (a) 
the  value  of  X  such  that  the  corresponding  cumulative  frequency  (F)  is  800,  (b)  the 
number  of  items  in  the  sample  with  X  <  80. 

5.  The  median  of  a  normal  distribution  is  89.0  and  the  first  quartile  is  75.5.  What 
is  the  standard  deviation? 

6.  An  electric  railway  company  operating  a  subway  uses  thousands  of  light  bulbs 
in  its  underground  stations.  On  the  morning  of  January  1,  1960,  the  company  put  into 
service  5,000  new  bulbs.  Assuming  that  the  distribution  of  length  of  life  for  these 
bulbs  is  normal,  with  a  mean  of  50  days  and  a  standard  deviation  of  19  days,  how  many 
of  them  would  need  to  be  replaced  by  midnight  on  January  31,  1960? 

How  many  by  March  9,  1960?  (Count  January  1  as  a  full  day.) 

7.  A  collection  of  human  skulls  is  divided  into  three  classes  according  as  a  certain 
length-breadth  index  JSTis  (a)  under  75,  (b)  from  75  to  80,  (c)  over  80.  These  are  called, 
respectively,  dolichocephalic  (long-headed),  mesocephalic  (medium)  and  brachyce- 
phalic  (short-headed).  Assuming  that  the  distribution  of  X  is  normal,  and  that  out  of 
50  skulls  examined  the  numbers  in  the  three  classes  are  29,  19,  and  2,  find  the  mean  and 
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standard  deviation  of  X  for  this  collection.  Hint:  Find  the  values  of  z  for  X  =  75  and 
80  from  the  corresponding  values  of  <£>(z). 

8.  The  mean  height  of  soldiers  in  a  regiment  containing  1,000  men  is  68.22  in.,  with 
a  standard  deviation  of  3.29  in.  If  the  distribution  is  normal,  how  many  men  over  6  ft 
tall  would  you  expect  to  find  in  the  regiment  ? 

9.  Verify  that  the  points  of  inflexion  of  the  standard  normal  curve  are  at  z  =  ±  1, 
and  that  the  tangents  to  the  curve  at  these  points  meet  the  z-axis  at  z  =  ±2.  Hint:  At 
the  points  of  inflection  the  second  derivative  of  </>(z)  vanishes. 

10."  Use  Eq.  (3.11.2)  to  find  an  approximation  to  the  probability  of  at  least  7 
successes  in  20  independent  trials  when  the  probability  of  success  in  each  trial  is  J. 
Also  calculate  the  approximations  given  by  Eqs.  (3.11.3)  and  (3.11.4)  and  compare 
with  the  true  value,  0.2142. 

11.  Calculate  an  approximation  to  the  probability  that  in  1,000  binomial  trials, 
with  probability  of  success  \  in  each  trial,  the  number  of  successes  will  be  outside  the 
limits  481  to  519  inclusive.  What  is  the  upper  bound  on  this  probability  given  by 
Chebyshev's  inequality  ? 

12.  A  normal  distribution  with  mean  /x  and  variance  a2  is  truncated  at  X  =  a  and 
all  values  less  than  a  are  discarded.  Show  that  the  mean  of  the  truncated  distribution 

is  at/x  +  CT<£(a)/[l  —  0(a)],  where  a  =  (a  —  /x)/cr.  Hint:       v</>(v)  dv  =  —   dcf>(v)  =  <f>{a). 

13.  Give  an  alternative  "proof"  of  the  theorem  that  the  limiting  form  of  the  stand- 
ardized binomial  variate,  as  iV->  oo,  is  the  standardized  normal  variate,  by  showing  that 
the  c.gtf.  of  the  binomial  tends  to  that  of  the  normal  as  N->  oo,  and  assuming  that  a 
distribution  is  uniquely  determined  by  its  c.g.f.  Hint:  The  c.g.f.  for  the  binomial  is 
N  log(l  —  d  +  9eh).  For  the  standardized  binomial  we  must  subtract  fxh/a  and  replace 
en  by  ehi«,  where  \l  —  Nd  and  a2  =  N0(\  -  6)  (see  §  2.12).  Expand  the  logarithm  in 
powers  of  h  and  show  that  K(h)  ->  h2/2  as  N-^  oo. 

14.  Use  Stirling's  approximation  (Appendix  A.2)  to  prove  that  if  the  probability 
of  success  in  a  single  trial  is  £,  then  in  a  series  of  n  binomial  trials  the  probability  of 
exactly  x  successes  is  (2/7r«)1/2  exp[— 2(x  —  n/2)2/n],  neglecting  terms  of  order  \/n. 

E.  (§§3.15-3.16) 

1.  Use  the  method  of  §  3.15  to  show  that  if  the  variance  of  X  is  approximately 
proportional  to  X2  then  the  transformation  Y  =  log  X  produces  a  variate  with  approx- 
imately constant  variance. 

2.  Show  that  if  the  variance  of  X  is  approximately  proportional  to  (1  —  X2)2  then 
a  suitable  transformation  for  producing  a  variate  with  approximately  constant  variance 
is  Y  =  i  log[(l  +  X)/(\  -  X)l 

3.  The  following  table  gives  the  percentage  damage  by  boll  weevils  on  cotton  plants 
treated  in  various  ways,  there  being  five  replications  for  each  treatment.  Use  the  simple 
angular  transformation,  Eq.  (3.15.1),  to  obtain  corresponding  values  of  A.  Calculate 
the  estimated  variance  {kz)  for  each  treatment,  both  for  the  original  variate  X  and  for 
the  angular  variate  A.  Does  the  variance  appear  to  be  more  nearly  constant  (as  between 
treatments)  after  the  transformation  than  before  ? 


Treatments 

Replications 

1 

2 

3 

4 

5 

1 

18 

17 

27 

34 

42 

2 

18 

14 

12 

27 

42 

3 

14 

14 

17 

23 

25 

4 

10 

8 

12 

26 

24 

5 

11 

9 

11 

15 

22 

THE  BINOMIAL,   POISSON  AND  NORMAL  DISTRIBUTIONS  77 

REFERENCES 

[1]     Glover,  J.  W.,  and  Wahr,  G.  (ed.),  Tables  of  Applied  Mathematics  in  Finance, 

Insurance,  Statistics,  Ann  Arbor,  1923. 
[2]    Pearson,  E.  S.,  and  Hartley,  H.  O.  (ed.),  Biometrika  Tables  for  Statisticians, 

Cambridge  Univ.  Press,  1954. 
[3]     Tables  of  the  Cumulative  Binomial  Distribution,  U.S.  Army  Ordnance  Corps, 

1952.  This  gives  to  7  decimal  places  the  values  of  B{x,  N,  6)  for  N  =  1(1)100, 

6  =  0.01(0.01)0.5,  and  all  x  concerned.    The  notation  0.01(0.01)0.5  means 

that  values  are  tabulated  at  intervals  of  0.01  from  0.01  to  0.5  inclusive. 
[4]    Harvard  Tables  of  the  Cumulative  Binomial  Distribution,  Harvard  Univ.  Press, 

1956.    In  this   5-place  table  N  =  1(1)50(2)100(10)200(20)500(50)1000,   and 

6  =  0.01(0.01)0.50.   Values  are  also  given  for  some  other  fractional  values  of 

6,  namely,  16ths  and  12ths. 
[5]    See,  for  example,  Ford,  L.  R.,  Differential  Equations,  McGraw-Hill,  1955,  p.  155. 
[6]     Molina,  E.  M.,  Poisson' s  Exponential  Binomial  Limit,  Van  Nostrand,   1947. 

Table  I  gives  6-  or  7-place  values  of  p(x,  /x)  for  p  =  0.001(0.001)0.01(0.01) 

0.30(0.1)15.0(1)100  and  for  appropriate  x.  Table  II  gives  P(c,  /a)  for  the  same 

values  of  /x  and  for  appropriate  c. 
Kitagawa,  T.,  Tables  of  Poisson  Distribution,  Baifukan,  Tokyo,  1952.    Gives 

7-  or  8-decimal  values  of  p(x,  p)  for  p  =  0.001  (0.001)  1.000(0.01;  10.00. 
[7]     Kelley,  T.  L.,  Statistical  Tables,  Harvard  Univ.  Press,  1948.   Gives  x  and  <f>(x) 

to  8  places  for  0(x)  =  0.5(0.0001)1. 
[8]    Federal   Government   Work   Projects   Administration,    Tables   of  Probability 

Functions,   Vol.  II,  National  Bureau  of  Standards,   1942.    Gives  <f>(x)  and 

2<D(*)  -  1   to' 15  places  for  x  =  0(0.0001)1(0.001)7.8.    An  auxiliary  table 

continues  to  x  =  10  (to  7  significant  figures). 
[9]    Raff,  M.  S.,    "On  Approximating  the  Point  Binomial,"  /.  Amer.  Stat.  Ass.,  51, 

1956,  pp.  293-303. 
[10]    Freeman,  M.  F.,  and  Tukey,  J.  W.,  "Transformations  Related  to  the  Angular 

and  the  Square  Root,"  Ann.  Math.  Stat.,  21,  1950,  pp.  607-611. 
[1 1]    Probability  graph  paper  (both  ordinary  and  logarithmic)  is  obtainable  from  the 

Codex  Book  Co.,  Inc.,  Norwood,  Mass. 
[12]    Fisher,  R.  A.,  and  Yates,  F.,  Statistical  Tables  for  Biological,  Agricultural  and 

Medical  Research,  Oliver  and  Boyd,  4th  ed.,  1953. 
[13]    Curtiss,  J.  H.,  "On  Transformations  Used  in  Analysis  of  Variance,"  Ann.  Math. 

Stat.,  14,  1943,  pp.  107-122. 
[14]    Bartlett,  M.  S.,  "The  Use  of  Transformations,"  Biometrics,  3,  1947,  pp.  39-52. 
[15]    Anscombe,  F.  J.,  "The  Transformation  of  Poisson,  Binomial  and  Negative- 
Binomial  Data,"  Biometrika,  35,  1948,  pp.  246-254. 
[16]    Anscombe,  F.  J.,  Discussion  following  a  paper  by  Hotelling,  /.  Roy.  Stat. 

Soc,  B15,  1953,  p.  229. 


Chapter  4 

OTHER  PROBABILITY  DISTRIBUTIONS 

4.1  Reasons  for  Studying  Probability  Distributions  The  main  purpose  of 
studying  distributions,  such  as  those  in  this  chapter  and  in  Chapter  3,  is  to  be 
able  to  draw  inferences,  about  populations  which  we  can  sample.  An  empirical 
sampling  distribution  will  be  more  or  less  irregular,  but  its  form  may  suggest  that 
the  population  distribution  is  closely  normal  or  Poisson  or  of  some  other 
well-known  mathematical  type.  In  the  next  chapter  we  shall  discuss  the  pro- 
cedures for  finding  the  parameters  of  a  theoretical  curve  to  make  it  fit  the  observed 
distribution  as  closely  as  possible.  Once  having  done  this,  we  can  proceed  to 
make  mathematical  deductions  about  the  population  and  perhaps  test  these  by 
further  observations. 

The  most  important  reason,  however,  for  studying  probability  distributions 
is  their  use  in  testing  statistical  hypotheses  and  assessing  the  significance  of 
experimental  results.  The  practical  statistician  calculates  certain  statistics  from 
his  observational  data  and  then  uses  prepared  tables  in  order  to  decide  whether 
or  not  to  accept  a  particular  hypothesis  or  to  judge  an  observed  result  as  sig- 
nificant. The  preparation  of  these  tables  requires  an  exact  knowledge  of  the 
distribution  of  the  statistic  concerned,  in  random  samples  from  a  population  of 
known  type.  When  these  distributions  have  been  found,  the  necessary  tables 
can  be  calculated,  nowadays  usually  with  the  help  of  electronic  digital  computers. 
For  reasons  of  mathematical  convenience,  the  population  is  generally  assumed 
to  be  normal.  We  shall  come  across  later  in  this  book  several  examples  of  the 
distribution  of  statistics  derived  from  samples  taken  from  normal  populations, 
and  we  shall  find  that  these  distributions  are  closely  related  to  one  or  other  of  the 
distributions  studied  in  the  present  chapter. 

4.2  The  Rectangular  (Uniform)  Distribution  This  has  already  been  men- 
tioned in  Example  4,  §2.10.  The  continuous  variate  X  has  the  probability 
density /(x)  =  (/?  -  a)"1,  a  <  x  <  /?,  and/(x)  =  0  for  x  >  /?andx  <  a.  The 
density  function  is  discontinuous  at  x  —  a  and  x  =  /?  (see  Figure  21).  Moments 
of  all  orders  may  easily  be  calculated. 

An  interesting  property  of  continuous  variates  is  that  if  F(x)  is  the  distri- 
bution function  of  Xy  and  if  we  make  the  transformation 

(4.2.1)  Y=F(X) 

then  Y  has  a  uniform  distribution,  with/(.y)  =  1,  on  the  interval  (0,  1). 

Since  any  distribution  function  has  a  range  from  0  to  1,  the  values  of  Y  must 
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obviously  be  confined  to  the  interval  0  to  1 .  Let  G(y)  be  the  distribution  function 
of  7,  and  let  a  and  b  be  values  of  Y  corresponding  to  values  a'  and  b'  of  X. 
Then 

a  =  F(a'),        b  =  F(b') 

The  probability  that  Y  lies  between  a  and  b  is  the  same  as  the  probability 
that  X  lies  between  a'  and  b',  which  is  F(b')  —  F(a').  Therefore, 

(4.2.2)  G(b)  -  G(a)  =  F(b')  -  F(a')  =  b-a 

where  0  <  a  <  b  <  1. 


t 

(0-af1)- 


0 


a 


0 


Fig.  21     Rectangular  distribution 

G( y  +  Av)  -  G( v) 
If  we  replace  a  by  y  and  &  by  y  +  Aj;,  this  becomes — : =  1,  or. 


A^ 


in  the  limit  as  Aj>  ->  0, 
(4.2.3) 


dG(y) 
dy 


=  1,        0<y<\ 


The  derivative  of  G(y)  is  the  density  function  of  7,  namely,  g(y),  so  that 
(4.2.4)  g(y)  =  1,        0  <  y  <  1 

which  shows  that  the  distribution  of  Y  is  rectangular.  The  transformation 
expressed  by  Eq.  (1)  is  called  the  probability  transformation.  It  is  sometimes 
useful,  in  proving  general  theorems  about  continuous  distributions,  to  be  able  to 
transform  them  to  so  simple  a  distribution,  mathematically  speaking,  as  the 
rectangular  one. 

4.3  Distribution  Function  of  a  Transformed  Variate  If  u(x)  is  a  given 
function  of  x,  and  if  jc  is  a  value  assumed  by  a  random  variable  X,  then  u  may 
be  regarded  as  a  value  of  a  new  random  variable  U.  The  distribution  function 
of  Uis 


(4.3.1) 


G(u)  =  P(U  <  u) 

f{x)  dx 


-   . 

J  u(x)<,u 
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where  /(x)  is  the  probability  density  for  X  and  the  integral  is  taken  over  all 
values  ofx  such  that  u(x)  <  u.  The  density  function  g(u)  is  found  by  differentiating 
G(u)  with  respect  to  u.  If  the  variate  Zis  discrete,  the  integral  in  Eq.  (1)  must  be 
replaced  by  a  sum. 

Example  1     If  /(x)  =  1,  0  <  x  <  1,  and  u(x)  =  -2  log*,  then  u(x)  <  u 
if  and  only  if  x  >  e~u/2.  Therefore, 


dx  =  1 


•u/2 


G(u)  = 

and 

(4.3.2)  g(u)=ie-",2i        0<u<oo 

This  distribution  is  illustrated  in  Figure  22. 


0.5 

1 
g(u) 

X. 

0.1 

i 

i 

i 

• 

0 

1 

2 

3 

4 

5 

6 

7 

Fig.  22    Exponential  distribution 


Example  2  Suppose  f(x)  =  (2/9)0  +1),  - 1  <  x  <  2,  and  u(x)  =  x2. 
The  range  of  u  is  0  <  u  <  4,  but  in  the  interval  from  0  to  1  there  are  two  values 
of  x  corresponding  to  any  given  u  (e.g.,  u  =  J  for  x  =  \  or  jc  =  —  %).  Since 
x  =  ±\lu  and  runs  from  —  1  to  2,  the  interval  of  x  corresponding  to  u(x)  <  u 
is  from  —  \ju  to  V«  as  long  as  u  <  1  but  only  from  —  1  to  \lu  when  u  >  1. 
Therefore, 


2  f  ^ 


(x  +  1)  dx 


and 


=  £\/w,        0  <  u  <  1 
2  fV5 

=  ±(u  +  2Vm  +  1),        1  <  u  <  4 
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(4.3.3) 


2m  ~1/2 
9(u)=——,        0<w<l 


(1  +m~1/2) 


Example  3  If/(x)  =  (j)(x)  =  {2n)~ll2e~x212  (so  that  Zis  a  standard  normal 
variate),  and  if  u(x)  =  \x2,  the  range  of  u  is  from  0  to  oo  and  its  domain  is  from 
—  oo  to  oo.  Therefore, 


and 
(4.3.4) 


G(M)=I 


g(u)  = 


JTt 


<Kx)  dx  =  20>(V2w)  -  1 


dG(u) 
du 


=  (nyl/2u-1/2e- 


(see  Appendix  A.9  on  differentiation  under  the  sign  of  integration). 

This  distribution  is  a  special  case  of  the  gamma  distribution  discussed  in  the 
next  section. 


4.4  The    Gamma  'Distribution    The    distribution   with    density    function 

va-l 


(4.4.1) 


/(*)  =  e~* 


IX") 


0  <  x  <  oo 


1  2 

Fks.  23    Gamma  distribution 
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where  a  >  0,  is  called  the  gamma  distribution  (see  Appendix  A. 5).  The  gamma 
function,  T(a),  is  a  kind  of  generalized  factorial  with  the  basic  property  T(a  +  1) 
=  ar(a).  Since  T(i)  =  n1/2  (see  Appendix  A.7),  the  gamma  distribution  with 
parameter  a  =  \  is  the  one  obtained  in  example  3  above,  given  by  Eq.  (4.3.4). 
The  form  of  the  distribution,  for  a  few  values  of  a,  is  shown  in  Figure  23. 
The  m.g.f.  is 


(4.4.2)  M(h)  =       ehxf(x)  dx 


-i: 


On  making  the  substitution  u  =  (1  -  h)x,  and  supposing  that  0  <  h  <  1, 
we  obtain 

(4.4.3)  M(h)  =  -i-  |    e-V-\l  -  h)~*  du 

r  (a)  J  o 

=  (1  -ft)- 

by  the  definition  of  T(a)  in  Eq.  (A.5.1). 
The  c.g.f.  is  therefore 

(4.4.4)  K(h)=  -alog(l-fc) 

/,       h2      h"  \ 

so  that  the  first  few  cumulants  are 

(4.4.5)  kx  =  a,     k2  =  a,     fc3  =  2a,     k:4  =  6a  .  .  . 

and  in  general  Kr  =  ar(r).  The  skewness  is  — ^  =  2a~1/2  and  the  kurtosis  is 

K2 

K*       a  -i 
— r  =  6a     . 

k22 

The  gamma  distribution  has  therefore  a  single  parameter  which  is  at  the 
same  time  the  mean  and  the  variance.  As  the  parameter  increases,  the  distri- 
bution becomes  more  nearly  symmetrical. 

A  somewhat  more  general  two-parameter  distribution,  with  density  function 

_  0-m  ± 


(4.4.6)  f(x)  =  e 

where  a  >  0,  ft  >  0,  is  also  called  a  gamma  distribution.   The  rth  cumulant  is 
Kr  =  a/?T(r).  Equation  (1)  corresponds  to  the  special  case  /?  =  1. 
The  distribution  function  for  the  gamma  variate  of  Eq.  (6)  is 

(4.4.7)  F(x)=[Xf(u)du 


f(x)=J7(m; 

1    Cx,p 
r(a)J0 
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where  we  have  made  the  substitution  u  =  fiv.  The  function  of  y  and  a,  defined  by 


»=[Vv-1 


(4.4.8)  r»=     e'Vldv 

is  called  the  incomplete  gamma  function,  and  has  been  extensively  tabulated  [1]. 
Pearson's  tables  actually  give  the  ratio  of  the  incomplete  to  the  complete  gamma 
function,  namely, 

(4.4.9)  I(«,  a  -  1)  =^| 

with  u  =  ycc~1/2. 

Thus,  to  find  the  value  of  F(x)  in  Eq.  (7)  for  any  given  x,  we  should  look  up 
in  the  tables  the  value  of /(^_1a~1/2,  a  -  1). 

It  may  be  noted  that  the  Poisson  cumulative  probability  P(c,  fi)  (see  §  3.8)  is 
expressible  in  terms  of  the  incomplete  gamma  function.  In  fact  (see  Problem 
B-ll), 

(4.4.10)  P{c^)=^M  =  l{u,c-\) 

T{c) 

with  u  =  fxc~112 

It  was  shown  in  §  2.12  that  in  order  to  find  the  cumulant  generating  function 
of  a  sum  of  independent  variates  we  simply  have  to  sum  the  individual  c.g.f.'s. 
The  sum  of  n  independent  gamma  variates  of  the  type  described  by  Eq.  (1),  with 
parameters  ctl9  a2  . .  .  an,  has  therefore,  by  Eq.  (4),  the  c.g.f. 

*(/*)= -I»g(i-/0 

i 

and  this  is  the  c.g.f.  of  a  gamma  variate  with  parameter  J],  ctt.  On  the  assumption 
that  a  distribution  is  completely  determined  by  its  c.g.f.,  this  shows  that  the 
sum  of  n  independent  gamma  variates  is  a  gamma  variate.  The  assumption  in 
question  is  justified  for  the  distributions  likely  to  occur  in  statistical  theory. 

4.5  The  Beta  Distributions  The  two-parameter  distribution  with  density 
function 

where  a  >  0,  p  >  0,  and  B(<x,  ft)  is  the  beta  function  (see  Appendix  A.  6),  is 

called  the  beta  distribution.    The  somewhat  similar  type  of  distribution  with 

density  function 

(1  4-  x)~a~p 
(4.5.2)  g(x)  =x°~l  K    I    \      ,        0  <  x  <  oo 

i*(a,  p) 

where  a  >  0,  /?  >  0,  may  also  be  called  a  beta  distribution,  and  will  be  referred 
to  as  the  beta-prime  distribution. 
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(4.5.3) 


Fig.  24    Beta  and  beta-prime  distributions 
The  principal  property  of  the  beta  function, 

mm 


m  P)  = 


r(a  +  « 

is  proved  in  the  Appendix.  The  general  shape  of  the  distributions  given  by 
Eqs.  (1)  and  (2)  is  illustrated  in  Figure  24,  for  a  =  4,  p  =  3.  The  curve  of/(jc) 
is  tangential  to  the  x-axis  at  x  =  0  and  x  =  1  if  a  and  p  are  both  greater  than  2. 
The  curve  of  g(x)  is  tangential  at  x  =  0  if  a  >  2. 

The  r\h  moment  about  zero  of  the  beta  distribution  is 


(4.5.4) 


fifr=^xrf(x 


)dx 

B(a  +  r,  P) 
B(a,  P) 

T(a  +  r) 


r(a  +  /o 


T(a)      T(a  +  p  +  r) 

_  (a  +  r  -  l)(a  +  r  -  2)  ...  a 

~  (a  +  j5  +  r  -  l)(a  +  P  +  r  -  2)  .  .  .  (a  +  P) 
and  similarly  for  the  beta-prime  distribution 


(4.5.5) 


M'r  =       xrg(x] 


)dx 

B(ct  +  r,P-r) 
B(«,P) 

(a  +  r  -  l)(a  +  r 


2) 


(/?-l)(/?_2)...(/?-r) 
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if/?  >  r.  The  means  of  these  distributions  are  therefore  at  a/(a  +  /?)anda/(/?  —  .1) 
respectively.  The  cumulants,  as  far  as  they  exist,  may  be  calculated  from  Eq.  (4) 
and  (5)  and  the  relations  given  in  §§  2.9  and  2.12. 
The  distribution  function  for  the  beta  variate  is 


du 


(4.5.6)  F(x)  =  [B(a,  £)]  " l  f  V  \l  -uf'1 

for  0  <  x  <  1 .  This  integral  is  called  the  incomplete  beta  function,  Bx(ct,  /?), 
and  has  been  tabulated  by  Karl  Pearson  and  his  associates  [2].  The  tables  give 
the  ratio  of  the  incomplete  to  the  complete  beta  function,  namely, 

(4.5.7)  /ia^)=__ 

and  so  the  value  of  F(x)  in  Eq.  (6)  can  be  read  directly  from  these  tables. 
For  the  beta-prime  distribution, 


X 

,a-l 


u*-\\  +u)-a-p  du 

0 


(4.5.8)  G(x)  =  [£(a,  0)] 
On  putting  1  +  x  —  y~l  and  1  +  u  =  ^_1,  this  becomes 

(4.5.9)  G(x)  =  [B(a,  /?)] 


vP-'il-vf-1  dv 
=  [B(a,  p)]-1  (1  -  wJ'-V"1  rfw 


4.6  The  Chi-Square  Distribution  Let  A^,  X2  .  .  .  Xn  be  n  independent  normal 
variates  with  means  /xx  .  .  .  \in  and  variances  o2 .  .  .  <rn2.  Let  the  standardized 
variate  corresponding  to  Xt  be 

(4.6.1)  Z*= 

Then,  as  shown  in  Example  3  of  §  4.3,  the  variable  \Z2  has  the  gamma 
distribution  with  parameter  a  =  \.  We  have  also  seen  in  §  4.4  that  £"=  x  QZ?) 
is  a  gamma  variate  with  parameter  £  ai9  in  this  case  w/2.   If  then  we  denote 

I2 
Y,  Z2  by  x2,  the  variate  —  has  the  density  function 

(4.6.2)  '(tHt)  fW2) 
The  density  function  for  x2  itself  is  g(x2),  where 

(4.6.3)  ffOrtdfoW^)^) 
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4.6 


so  that 
(4.6.4) 


g(x2) 


=(?) 


2\(n/2)-l 


X2/2 


2T(nl2) 

A  distribution  with  this  density  function  is  called  the  chi-square  (x2)  distri- 
bution. The  number  n  is  called  the  number  of  degrees  of  freedom.  It  is  the 
number  of  independent  normal  variates  whose  squares  are  added  to  produce  x2. 

The  chi-square  distribution  is  an  important  one  in  statistical  theory,  being 
much  used  for  testing  the  goodness  of  fit  of  a  theoretical  curve  to  an  empirical 
distribution  and  for  testing  certain  types  of  statistical  hypotheses.  Examples  of 
these  uses  will  be  given  later.  Meanwhile  we  list  a  few  properties  of  this  distri- 
bution. 

The  shape  of  the  curve  of  g(x2),  plotted  against  #2,  depends  on  the  value  of 
n.  The  curves  for  different  n  look  like  the  gamma  distributions  of  Figure  23. 
Since  the  rth  cumulant  for  #2/2  is  Kr  =  (n/2)T(r),  the  rth  cumulant  for  x: 
2rKr  =  2r-\r-  l)\n. 

The  expectation,  variance  and  skewness  are  therefore  given  by 


is 


(4.6.5) 


E(X2) 
k2  =  V(X2) 

*3 


2n 


?i  = 


K2 


3/2 


->© 


1/2 


The  distribution  function  for  x2  is 


(4.6.6) 


G(M)=f 


9(x2)dx: 


In  the  statistical  applications,  we  are  usually  interested  in  the  area  of  the  chi- 
square  curve  to  the  right  of  a  particular  value  u  (the  shaded  area  in  Figure  25). 


g(X2) 


1-G(u) 


X  — *~ 
Fig.  25    Chi-square  distribution 

This  is  equal  to  1  —  G(u).  The  table  in  Appendix  B  gives  values  of  u  correspond- 
ing to  selected  values  of  1  —  G(u)  for  all  n  from  1  to  30.  More  extensive  tables 
may  be  found  in  [3]. 
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The  function  G(u)  is  converted  by  the  transformation  x2  =  2>v  into  an  in- 
complete gamma  function : 

dv 

(4.6.7)  GO 


Cull 

Ho vi 


T(n/2) 
_ru/2(n/2) 

T(nl2) 
=  I{u(2ny1/2Anl2)-l} 

by  Eq.  (4.4.9),  so  that  the  tables  [1]  may  also  be  used  to  find  G(u). 

For  large  n,  it  was  shown  by  Fisher  that  (2#2)1/2  —  (2n  —  1)1/2  is  approxi- 
mately a  standard  normal  variate,  so  that  if  we  need  values  beyond  the  scope 
of  the  table  in  the  Appendix  we  can  put 

(4.6.8)  (2*2)1/2  «  z  +  (2n  -  1)1/2 

where  z  is  given  by  <D(z)  =  G(j2). 

Thus,  suppose  in  Figure  25  that  n  =  30  and  the  shaded  area  is  0.05.  The 
value  of  u  gi ven  by  the  table  of  %2  is  43. 773.  The  corresponding  z,  for  O(z)  =  0.95, 
is  1.645,  so  that 

(2*2)1/2  «  1.645  +  (59)1/2  =  9.326 

which  gives  x2  —  43.49. 

A  still  better  approximation  is  that  of  Wilson  and  Hilferty  [4],  namely, 

/y2vl/3  2  /2\1/2 

(46-9)  (7)      =1-Vn+ZU 

For  the  case  n  =  30,  z  =  1.645,  this  gives  ^-1  =  1  -  1/135  +  1. 645/(1 35)1/2 
=  1.1341,  from  which  x2  =  43.76.  This  is  very  close  to  the  true  value. 

*  4.7  Theorems  on  the  Chi-Square  Distribution  The  following  theorems  are 
sometimes  useful  in  establishing  the  distributions  of  particular  statistics.  The 
proofs  are  either  sketched  briefly  or  are  omitted  altogether. 

Theorem  4.1  If  Yt  (i  =  1,  2  . . .  n)  is  one  of  a  set  of  orthogonal  linear 
functions  (see  Appendix  A.  10)  of  the  independent  variates  Xj(j  =  1,  2  .  .  .  «),  and 
if  the  Xj  are  normal  with  mean  0  and  variance  1,  then  the  distribution  of  £f  Y2  is 
chi-square  with  n  degrees  of  freedom. 

We  first  note  that  .the  distribution  of  any  one  of  the  Yt  is  normal.  This 
follows  from  Eq.  (2.1 1.8)  and  the  assumption  that  the  distribution  is  determined 
by  its  cumulant  generating  function.  The  further  assumption  that  the  different 
Yt  are  orthogonal  to  each  other  implies  that  they  are  independent  and  that 

(4.7.1)  Ir,-2=X*/ 


and,  since  £j  X2  is  a  chi-square  variate,  by  §  4.6,  so  is  £,-  Yt 
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Examples  of  orthogonal  linear  transformations  are 

Y1=(iy1/2(X1+X2) 

r2=(2)-1/2(x2-*i) 


(4.7.2) 
and 


(4.7.3) 


(Y1  =2~\Xl  +X2+X3+X4) 
Y2=2-1'\X1-X2) 
Y3=6-ll2(X1^-X2-2X3) 

{  Y4  =  12-1/2(Xt  +  X2  +  X3  -  3X4) 


For  each  Yt  the'sum  of  the  squares  of  the  coefficients  of  the  Xj  is  1,  and  for  any 
two  different  Y' s  the  sum  of  the  products  of  the  coefficients,  pair  by  pair,  is  0. 
This  is  the  distinguishing  characteristic  of  an  orthogonal  linear  transformation. 
The  fact  that  the  sum  of  the  squares  of  the  coefficients  of  Yt  is  1  shows  that 
the  variance  of  Yt  is  1  (the  same  as  the  variance  of  the  XJ).  Also  the  expectation 
of  Yt  is  obviously  0,  so  that  the  Yt  are  standard  normal  variates.  Each  Y2  is 
therefore  a  chi-square  variate  with  one  degree  of  freedom.  Note  that  Y2 
=  £  Xj)2/n  =  nX2. 

Theorem  4.2    The  sum  of  two  independent  chi-square  variates  with  nl  and  n2 
degrees  of  freedom  is  a  chi-square  variate  with  nx  +  n2  degrees  of  freedom. 
This  follows  from  the  corresponding  property  for  gamma  variates. 

Theorem  4.3  (Fisher's  Theorem)  If  A  =  £;=1  x/  and  B  =  £*=1  7.2, 
where  the  Yt  are  orthogonal  linear  functions  of  the  independent  standard  normal 
variates  Xjf  then  A  —  Bis  a  chi-square  variate  with  n  —  h  degrees  offreedomy  and 
is  independent  of  B. 

Note  that  by  Theorem  4.2  and  the  distribution  of  Y2,  the  quantity  B  is  a 
chi-square  variate  with  h  degrees  of  freedom.  Since  A  =  £"_  t  Y2,  the  difference 
A  —  B  is  a  sum  of  n  —  h  of  the  Y2  and  is  therefore  a  chi-square  variate  with 
n  —  h  degrees  of  freedom.  Fisher's  theorem  states  that  A  —  B  and  B  are  dis- 
tributed independently  [5]. 

Theorem  4.4  (Cochran's  Theorem)  If  A  =  £"=i  X2  and  if  A  =  q1  +  q2 
+  .  .  .  +  qki  where  the  q's  are  quadratic  forms  in  the  Xj  with  nun2  .  .  .  nk  degrees 
of  freedom  respectively,  then  a  necessary  and  sufficient  condition  that  the  q's  are 
independent  chi-square  variates  with  nl9  n2,  .  .  .  nk  degrees  of  freedom  is 

nt  +  n2+  .  .  .  +  nk  =  n 

A  quadratic  form  is  an  expression  of  the  type  q  =  £y  ayXiXj,  where  the  atj 
are  real  numbers.  To  say  that  the  form  has  r  degrees  of  freedom  (or  is  of  rank  r) 
means  that  the  largest  non-zero  determinant  which  can  be  formed  from  the 
matrix  atj  has  r  rows  and  columns.  See  Appendix  A. 20  and  reference  [6]. 
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Example  4  If  X  =  (1/ri)  £  Xh  then  as  shown  above,  nX2  is  a  chi-square 
variate  with  one  d.f.  We  may  write 

(4.7.4)  A  =  £  Xt2  =  X  (Xf  -  X)2  +  hX2 

It  follows  that  £  (A^  —  X)2,  which  is  rc  —  1  times  the  sample  variance,  is 
distributed  as  j2  with  n  —  1  degrees  of  freedom,  independently  ofnX2. 

*  4.8  The  Log-Normal  Distribution  It  sometimes  happens  that  if  a  variate  X 
(which  takes  only  positive  values)  is  markedly  skew  in  its  distribution,  log  X  is 
much  more  nearly  normal.  This  may  be  tested  readily  by  plotting  the  cumulative 
percentage  frequency  for  a  good-sized  sample  against  the  corresponding  X  on 
special  logarithmic  probability  graph  paper.  This  paper  has  a  logarithmic  scale 
along  one  axis  and  a  probability  scale  (like  that  in  Figure  20)  along  the  other.  If 
the  resulting  points  lie  nearly  on  a  straight  line,  the  distribution  of  X  in  the  popu- 
lation may  be  taken  as  log-normal. 

Some  examples  of  distributions  which  have  been  found  to  be  nearly  log- 
normal  are  the  sizes  of  silver  particles  in  a  photographic  emulsion,  the  survival 
times  of  bacteria  in  given  strengths  of  disinfectant,  the  effective  lengths  of  life  of 
some  types  of  industrial  equipment,  the  blood  pressures  of  human  beings,  the 
magnitudes  of  maximum  annual  floods  for  a  given  river,  and  even  the  numbers  of 
words  in  a  sentence  written  by  George  Bernard  Shaw. 

Let  Y  =  loge  X,  and  suppose  that  the  distribution  of  Y  is  normal  with  mean 
a  and  variance  /?.  Let  f(x)  and  g(y)  be  the  density  functions  for  X  and  Y  re- 
spectively. Then 

(4.8.1)  /(x)  dx  =  g(y)  dy  =  g(y)  -  dx 

x 

so  that 

(4.8.2)  f{x)  =  x-'giy)  =  x-l(2npy1/2e  ~^-^^ 
The  rth  moment  about  0  of  the  variate  X  is 


(4.8.3)  n'r  =       xrf(x)  dx 


'■-I?' 


-j: 


e"g(y)  dy 

> 

since  x  =  e>  if  y  =  loge  x.   Carrying  out  the  integration,  we  obtain 

(4.8.4)  Ai'r=exp(ra+^ 
The  mean  of  X  is  therefore 

(4.8.5)  n\  =  fx  =  exp(a  +  ip) 
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and  the  variance  is 

(4.8.6)  [i'2  -  (fi\)2  =  a2  =  exp(2a  +  20)  -  exp(2a  +  ft 

2    2 

where  rj2  =  ep  —  1 .  The  quantity  r\  is  the  ratio  <j\\i,  which  is  called  the  coefficient 
of  variation.   The  skewness  of  the  distribution  is  given  by 

(4.8.7)  yx  =  rj3  +  3iy 

If  Y  =  log10  X  =  c  loge  X,  where  c  =  0.4343  approximately,  and  if  a  and 
p  now  refer  to  log10  X,  the  Eqs.  (5),  (6)  and  (7)  will  need  to  be  modified  by 
writing  ot/c  for  a  and  fi/c2  for  /?. 

Various  modifications  of  the  simple  log-normal  distribution  have  been 
suggested.  A  full  discussion  may  be  found  in  [7],  and  a  table  of  critical  values  of 
the  distribution  in  [8].  The  logarithmic  transformation  is  often  used  to  stabilize 
variance  in  situations  where  the  observational  data  fall  into  groups  with  different 
means  and  where  in  each  group  the  standard  deviation  is  roughly  proportional 
to  the  mean.  The  transformed  variates  will  in  this  case  have  approximately 
constant  variance. 

4.9  Families  of  Theoretical  Distributions  The  process  of  curve-fitting  was 
at  one  lime  very  popular  among  statisticians — much  more  so  than  it  is  today — 
and  whole  families  of  theoretical  distributions  were  invented  to  fit  (as  it  was 
hoped)  almost  any  kind  of  empirical  distribution  that  might  turn  up.  One  such 
family  (including  eight  principal  types  of  curve  and  a  variety  of  special  cases) 
was  devised  by  Karl  Pearson.  Another  idea,  due  to  the  Norwegian  statisticians 
Gram  and  Charlier,  was  to  use  the  normal  distribution,  modified  by  adding  terms 
proportional  to  the  1st,  2nd,  3rd  .  .  .  derivatives  of  the  normal  density  function. 
The  coefficients  of  these  terms  turn  out  to  be  either  zero  or  else  simply  expressible 
in  terms  of  the  cumulants  of  the  distribution.  A  brief  discussion  of  the  Pearson 
family  of  curves  and  of  the  Gram-Charlier  series  may  be  found  in  [9]. 

4.10  The  Central  Limit  Theorem  We  close  this  chapter  with  a  short  des- 
cription of  a  famous  theorem  which  plays  a  central  role  in  the  theory  of  statistical 
inference,  and  accounts  very  largely  for  the  importance  of  the  normal  distri- 
bution in  theoretical  investigations. 

Let  Xl9  X2  .  .  .  Xn  be  independent  random  variates  all  having  the  same  dis- 
tribution with  mean  \x  and  variance  a2,  but  not  necessarily  normal.  Let  the 
standardized  variate  corresponding  to  Xj  be 

(4.10.1)  Zj=^^ 


and  let  Yn  be  defined  by 

(4.10.2)  Yn=^i  =  n^2Z 
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where  Z  is  the  arithmetic  mean  of  the  Z,.  Then  the  theorem  (in  its  simplest  and 
least  general  form)  states  that  as  n  -*  oo,  Yn  tends  to  a  standard  normal  variate, 
or,  in  symbols, 

(4.10.3)  P(Yn<y)^0(y) 

The  point  of  the  theorem  is  that,  no  matter  what  the  original  distribution  of 
Z,  may  be  (provided  of  course  that  Xj  possesses  a  mean  and  variance),  the  mean 
of  a  large  enough  sample  will  have  a  nearly  normal  distribution. 

The  cumulant  generating  function  for  Z,  will  be 

(4.10.4)  Kj(h)=^+0(h3)* 

since  the  coefficients  of  h  and  of  h2/2 !  are  0  and  1  respectively.  Since  Yn  is  a 
linear  function  of  the  Z,-,  with  coefficients  all  equal  to  n~1/2t  the  c.g.f.  for  Yn  will 
be 

(4.10.5)  Kj(h)=lKj(hn-1/2) 

h2 
=  —  +  terms  of  order  n  ~ 1/2 

2 

As  n  ->  oo,  #(/*)  ->•  /z2/2,  which  is  the  c.g.f.  for  a  standard  normal  variate.  This 
suggests  the  result,  which  is  indeed  true,  that 

(4.10.6)  limp(n1/2 ^— ^  <  y)  =  O(j) 

It  is  not  necessary  that  the  Xj  should  all  have  the  same  distribution.  If 
E(Xj)  =  pj  and  V(X})  =  <r/,  and  if  M„  =  X;  Af/  and  5*„2  =  J^  o-/,  then  (as 
proved  by  Lindeberg) 

(4.10.7)  lim  p(S  Zjg~  Mn  <y\=  ®(y) 
provided  that  the  following  condition  holds  for  every  s  >  0 : 

(4.10.8)  lim  ^j  £   f  (x  -  ^)2/,W  <b  =  0 

where ^(x)  is  the  density  function  for  Xi  and  where  the  integral  is  taken  over  all 
values  of  x  such  that  \x  —  Hj\  >  Sns.  This  condition  implies  that  Sn  ->  oo  but 
ffj/iSn  ->  0,  as  «  ->  oo,  for  every  value  of  j.  In  other  words,  the  total  sum  of 
variances  tends  to  infinity  but  the  proportional  contribution  of  each  individual 

♦The  notation  0(h3)  means  terms  of  the  order  of  h3.  This  includes  all  terms  proportional 
to  h3  or  to  any  higher  power  of  h. 
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variate  to  this  sum  tends  to  zero.  If  the  variate  Xj  is  discrete  instead  of  continuous 
the  integral  in  Eq.  (8)  must  be  replaced  by  a  sum. 

Lyapunov  proved  that  if  the  absolute  third  moment  exists  for  each  of  the 
Xj,  so  that  for  any  n  a  finite  Rn  exists,  given  by 


(4.10.9)  *.3=£ 


I*  ~  ff/l3//*)  dx 


then  the  condition  for  the  central  limit  theorem,  Eq.  (7),  to  hold  is  the  simple  one 

(4.10.10)  lim  — =0 

n-»oo   &n 

It  is  not  even  necessary  that  the  Xj  should  be  independent.  It  is  sufficient  that 
Xt  and  Xj  should  be  independent  for  \i  —  j\  >  m,  where  m  is  some  fixed  number. 
This  means  that  if  the  variates  are  arranged  in  some  natural  order,  consecutive  or 
nearly  consecutive  members  may  be  dependent,  provided  that  all  widely  separated 
ones  are  independent. 

Sums  of  random  variables  whose  distributions  do  not  have  a  finite  second 
moment  may  not  show  any  tendency  to  approach  normality.  If  the  Xj  have  a 
Cauchy  distribution,  given  by 

(4.10.11)  f(x)  =  [n(\+x2)Y1 

then  the  distribution  of  X(=  n~l  Jj  Xj)  is  the  same  Cauchy  distribution,  no 
matter  how  large  n  may  be. 

For  a  fuller  discussion  of  the  Central  Limit  Theorem  see  [10]  and  [11]. 


PROBLEMS 

A.  (§§4.1-4.3) 

1.  Show  that  the  m.g.f.  for  the  uniform  distribution  on  the  interval  (0,  1)  has  the 
form  M(h)  =  1  +  h/2l  +  h2/3\  +  .  .  .  .  Write  down  the  expression  for  the  c.g.f., 
expand  it  as  far  as  the  term  in  hA,  and  so  obtain  the  first  four  cumulants. 

2.  What  transformation  will  change  the  variate  X  to  one  having  a  uniform  dis- 
tribution on  (0,  1),  if  the  density  function  for  X  is  f(x)  =  (x  —  l)/2,  1  <  x  <  3,  and 
f(x)  =  0  for  x  <  1  and  x  >  3? 

3.  If/(x)  =  2xe~*\  x>0,  find  the  density  function  for  U,  where  U  =  X2. 

4.  If  X  has  the  density  function  f(x),  x  >  0,  what  is  the  density  function  for 

iu  -  b\112 
U  =  aX*  ±  b,  where  a  >  0?  Hint:  u  >  b\  0  <  X  <    ,  when  U  <  u.    To 


m'- 


get  g(u),  use  Appendix  A.9. 

5.  If  X  has  the  density  function  f(x)  =  2x,  0  <  x  <  1,  find  the  distribution  of 

1    _  wl/2  J   _|_  wl/2 

U  =  (3X  -  l)2.   Hint:  For  U  <  w,  0  <  u  <  1,  X  goes  from to  -~ ; 

1  +  w1/2 
for  1  <  u  <  4,  X  goes  from  0  to . 
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6.  If  X  has  the  continuous  distribution  function  F(x),  what  are  the  distribution 
functions  of  (a)  ex  (b)  sin  X  (c)  F(X)7  Hint:  P(ex  <  x)  =  P(X  <  log  x)  =  F(log  x). 

B.  (§§  4.4-4.5) 

'„\  T(n  +  1)  1 


1.  Show  that 


r         T(r  +  l)T(n  -  r  +  1)      r£(/i  -  r  +  1,  r)' 


2.  Show  that  the  gamma  distribution,  Eq.  (4.4.6),  has  a  single  mode  at  x  =  j8(a  —  1) 
if  a  >  1  and  touches  the  A>axis  at  the  origin  if  a  >  2. 

3.  Evaluate  as  gamma  functions  the  following  integrals : 

(a)  f  °V**2/3  dx     (b)  f  %-3*(l  +  x/2)7  dx.  Hint:  In  (b)  put  1  +  x/2  =  y/6. 

4.  Find  the  constant  K  if  K  \       z2(l  +  z2)-*/2  dz  =  1.  #wf;  Put  z  =  tan  0  and 

J  -oo 

use  Eq.  (A.6.3). 

5.  Show  that 


cos?  0  d6  = 
o 
6.  Use  Eq.  (A.6.4)  to  show  that 


'7T/2 

sinPddd  =  \B\ 
o 


(^4 


ifi/if:  In  Eq.  (A.6.4)  divide  the  domain  of  integration  into  two  parts,  0  to  1  and  1  to  co. 
In  the  second  part  put  y  =  \/x. 

7.  Prove  that  Bx(fi,  a)  =  £(a,  £)  -  5i-,(a,  ft  and  that  therefore  Ix(fi,  a)  = 
1  -  /i-*(a,  ft.  i/m/;  In  Eq.  (4.5.6),  put  w  =  1  -  v. 

8.  Show  that  the  expectation  of  the  positive  square  root  of  a  beta  variate  with 
parameters  a  and  fi  is  T(a  +  i)I\a  +  ft/[r(a)r(a  +  fi  +  I)]  and  that  the  expectation 
of  the  positive  square  root  of  a  one-parameter  gamma  variate  is  T(a  +  i)/I\a). 

9.  The  harmonic  mean  of  a  variate  ^  may  be  defined  as  the  reciprocal  of  the  ex- 
pectation of  l/X.  If  the  probability  density  for  X  isf(x)  =  xne-x/T(n  +  1),  0  <  x  < 
oo,  n  >  0,  show  that  the  harmonic  mean  of  X  is  equal  to  n. 

Find  the  harmonic  mean  of  the  distribution  with  density 

f(x)  =  xm-\\  +  x)-m-n/B(m,  n),      0  <  x  <  oo,      m  >  1,      «  >  2. 

10.  If  X  is  a  Poisson  variate  with  mean  M,  and  M  itself  a  gamma  variate  with 
parameters  a  and  ft  show  that  the  probability  that  X  =  x  is  given  by 

i>(;r  =  jc)  =  r\x  +  <x)p*i[r(ot)x\  (i  +  £)<*+*] 

=  (-D^a)^(l+i8)— . 

Since  this  is,  for  each  x(0,  1,  2  .  .  .),  a  term  in  the  binomial  expansion  of  [(1  +  ft  -  ft_a, 
the  distribution  is  called  "negative  binomial."  See  §  1.11.  Hint:  Integrate  the  joint 
distribution  of  X  and  M  over  all  values  of  u. 

11.  Prove  that  the  cumulative  Poisson  probability  P(c,  n)  may  be  expressed  in 
terms  of  the  incomplete  gamma  function  by  the  relation  P(c,  /x)  =  T/x(c)/r(c)  = 
/(w,  c  —  1),  where  u  =  [jlc-1/2.  Hint:  Taylor's  theorem  may  be  written 

f(a  +  h)  =  /(«)  +  hfXa)  +  .  .  .  +  (fl'xy/^W 


(^J>(a 


+  th){\  -  ty-1  dt 
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Put  f(x)  =  ex,  a  =  0,  h  =  \i,  and  divide  through  by  ev.   Note  that 

Z^~j=l  -Pie,  11). 

C.  (§§  4.6-4.9) 

1.  Show  that  the  cg.f.  of  a  standardized  one-parameter  gamma  variate  tends  to  the 
value  h2/2  as  the  parameter  tends  to  infinity.  Hence  show  that  the  c.g.f.  of  the  standard- 
ized chi-square  distribution  tends  to  this  value  as  « ->  oo.  (See  Problem  D.13  of 
Chapter  3.) 

2.  Prove  that  if  t  =  (x2  —  ")/(2«)1/2,  the  density  function  for  t  is  given  by/(0  = 
K(t  +  cY~xe-c\  -c  <t  <  oo,  where  c2  =  n/2  and  K  =  (c)cVc2/r(c2).  (This  dis- 
tribution is  known  as  Pearson's  Type  III.  See  §  4.9.  The  skewness  is  2/c  =  (8/«)1/2). 

3.  Show  that  the  probability  that  x2  >  c  may  be  written  1  -  7(c/(2«)1/2,  (n  -  2)/2), 
where  /  is  Pearson's  incomplete  gamma  function. 

4.  If  y  dx  is  the  probability  that  X  lies  between  x  and  x  +  dx  and  if  y  is  given  by 
the  solution  of  the  differential  equation  dy/dx  =  y(a  —  x)/(bx  +  c),  show  that  (for 
suitable  values  of  the  constants  a,  b,  c)  a  certain  linear  function  of  X  has  the  x2  dis- 
tribution with  n  degrees  of  freedom,  where  n  =  2(1  +  a/b  +  c/b2).  Hint:  The  arbitrary 
constant  in  the  solution  of  the  differential  equation  is  determined  by  the  condition 
j ydx  =  1,  from  —  c/b  to  oo.  Put  V  =  2{bX  +  c)/b2  and  show  that  Kis  a  x2-variate. 

5.  If  Y  =  loge  X,  and  Y  is  a  standard  normal  variate,  write  down  the  density  func- 
tion for  X,  and  calculate  the  expectation  and  variance  of  X  by  integration. 

6.  If  loge  X  is  normally  distributed  with  mean  1  and  variance  4  calculate  the 
probability  that  X  lies  between  1/2  and  2. 
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Chapter  5 

SAMPLE  AND  POPULATION 

5.1  Inferences  from  Sample  to  Population  As  we  have  already  seen,  the 
data  in  most  statistical  problems  relate  to  a  sample  drawn  from  some  parent 
population,  or  universe  (as  it  is  sometimes  called).  Various  characteristics  of 
the  sample,  such  as  the  mean,  median,  standard  deviation  or  skewness,  may  be 
calculated  from  the  data,  and  they  serve  to  give  a  concise  description  of  the 
sample  itself.  Their  more  important  use,  however,  is  to  enable  us  to  make 
statements  about  the  population.  Such  statements,  of  course,  being  of  the 
nature  of  inductive  inferences,  cannot  be  made  with  complete  certainty,  but 
only  with  more  or  less  probability  of  being  true.  Nevertheless  it  is  worth  while 
to  be  able  to  state,  for  example,  that  the  mean  of  a  particular  population  may  be 
taken  as  lying  between  21.7  and  25.8,  with  a  probability  of  0.90  that  this  state- 
ment is  true.  We  shall  see  in  the  present  chapter  how  some  estimates  of  this  sort 
are  arrived  at. 

The  population  characteristics  in  which  we  are  interested  are  usually  para- 
meters which  occur  in  the  distribution  of  some  variate.  If,  for  example,  the 
population  is  assumed  to  be  normal,  as  far  as  a  particular  variate  is  concerned, 
the  density  function  for  this  variate  will  contain  two  parameters,  \l  and  a,  which 
are  the  population  mean  and  standard  deviation  respectively.  These  may  be 
estimated  from  the  characteristics  of  a  sample,  such  as  the  median  and  the 
range,  for  instance,  or  the  sample  mean  and  the  sample  standard  deviation. 

When  a  sample  is  used  to  make  inferences  about  the  population,  we  generally 
assume  that  the  sample  is  random.  This  usually  means  (when  the  population  is 
finite)  that  every  individual  in  the  population  has  an  equal  chance  of  being 
included  in  the  sample.  More  generally,  if  X  is  the  random  variable  which  is 
under  consideration  and  which  has  a  distribution  function  F(x)  in  the  population, 
and  if  Xl9  X2, .  .  . ,  XN  are  measured  values  of  X  on  sample  items  from  the 
population,  the  sampling  is  random  if  all  the  Xt(i  =  1,  2  ...  N)  are  independent 
random  variables  (see  §  1.13),  each  with  the  same  distribution  function  as  X 
itself.  The  probability  that  the  observed  sample  has  values  equal  to  or  less  than 
xu  x29 .  .-: ,  xN  for  the  respective  items  is  then  F(x1)-F(x2)  .  .  .  F(xN). 

It  is  usually  desirable  that  sampling  should  be  as  nearly  random  as  possible, 
although  this  is  often  hard  to  achieve  in  practice.  Even  if  the  sampling  is  not 
purely  random,  it  is  still  possible  to  make  valid  inferences,  provided  that  the 
respective  probabilities  of  being  included  in  the  sample  are  known  for  all 
members  of  the  population.  In  a  scheme  described  as  stratified  sampling,  for 
instance,  the  whole  population  is  divided  into  classes  (or  strata),  each  of  which  is 
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sampled  separately.  The  sizes  of  the  various  strata  must,  however,  be  known,  and 
within  each  stratum  the  sampling  must  be  random.  For  valid  statistical  inferences 
there  must  always  be  present  somewhere  this  quality  of  randomness.  A  fuller 
discussion  of  some  common  sampling  procedures  will  be  found  in  Chapter  7. 

5.2  Point  Estimation  and  Interval  Estimation  Sampling  theory  deals  with 
questions  like  the  following:  given  a  random  sample  of  N  variates  from  a 
certain  population,  what  can  we  say  about  the  parameters  that  define  the 
distribution  of  such  variates  within  the  population?  There  are  two  distinct 
questions  that  we  may  ask  about  any  one  parameter,  namely,  what  is  the  best 
value  to  use  for  it  and  how  reliable  is  this  best  value?  The  first  question  is  one 
of  point  estimation — we  want  a  single  value  which  in  some  sense  is  the  "best" 
estimate  we  can  make  of  the  parameter  (various  criteria  are  possible  for  judging 
the  goodness  of  an  estimate  and  they  do  not  always  agree  in  their  choice  of  the 
best).  The  second  question  has  to  do  with  the  interval  in  which  we  can  confi- 
dently expect  the  true  but  unknown  value  of  the  parameter  to  lie,  and  is  said  to 
be  a  problem  of  interval  estimation.  We  may,  for  instance,  be  able  to  say  on  the 
basis  of  a  sample  that  the  best  estimate  we  can  make  of  the  population  mean  is 
159  lb  and  that  we  feel  90%  confident  that  the  true  value  is  somewhere  between 
150  lb  and  168  lb.  This  interval  (150  lb  to  168  lb)  is  called  a  confidence  interval, 
with  confidence  coefficient  90%  or  0.90.  The  confidence  interval  is  a  random 
variate,  calculated  from  the  sample  and  having  a  probability  distribution, 
whereas  of  course  the  population  mean,  although  unknown,  is  not  a  random 
variate  at  all  in  the  usual  sense.  We  should  not  therefore  speak  of  the  probability 
that  the  population  mean  lies  in  a  given  interval  but  rather  of  the  probability  that 
the  given  interval  includes  the  population  mean. 

To  say  that  a  confidence  interval  for  a  parameter  has  a  confidence  coefficient  of 
0.90  means  that  the  statement  "this  interval  includes  the  true  value"  has  a  proba- 
bility equal  to  0.90  of  being  correct.  In  other  words,  if  we  continue  to  make 
similar  statements  on  the  basis  of  many  other  samples  from  the  same  population, 
using  the  same  estimation  procedure,  about  90%  of  these  statements  will  be  true. 

The  concept  of  confidence  intervals  is  one  of  the  main  contributions  to 
statistical  theory  by  J.  Neyman  and  E.  S.  Pearson  [1].  A  somewhat  different 
concept,  leading  in  many  cases  (although  not  in  all)  to  identical  results,  is  that  of 
fiducial  intervals,  due  to  Sir  Ronald  A.  Fisher  [2].  In  this  view  it  is  permissible  to 
attach  a  fiducial  probability  f(9)  to  the  parameter  0,  although  this  is  not  to  be 
interpreted  in  the  ordinary  (frequency)  sense  of  probability.  The  idea  is  that 
I  2/(0)  d6  is  a  measure  of  our  belief  that  6  lies  between  6t  and  62  (the  Latin 

word  "fiducia"  means  trust).  The  fiducial  probability,  like  the  confidence  inter- 
val, is  calculated  from  the  known  sampling  distribution  of  the  statistic  used  to 
estimate  6  (see  §  5.4). 

5.3  Confidence  Belts  For  simplicity  we  consider  a  population  defined  by  a 
single  parameter  6  and  we  suppose  that  a  statistic  T,  derived  from  a  sample  of 
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size  N,  is  used  to  estimate  9.  The  statistic  T  is  often  called  an  estimator  of  9. 
The  distribution  of  T,  for  given  0,  is  supposedly  known.  That  is,  for  any 
admissible  value  of  9  (say  in  the  range  from  a  to  p)  we  can  calculate  the  proba- 
bility that  T  will  lie  between  two 
given  values,  /x  and  t2. 

In  the  diagram,  Figure  26,  t  is 
plotted  as  abscissa  and  9  as  ordinate. 
The  possible  values  of  T  for  a  sample 
drawn  from  a  population  with  a  given 
value  of  9,  say  9',  lie  along  the  line  AB. 
On  this  line  we  can  mark  two  points,  at 
tx  and  t2,  such  that  the  probability  that 
T  <  ^  is  a  fixed  value  £,  say  0.05,  and 
the  probability  that  T  >  t2  is  also  e. 

If  F(t  1 9)  is  the  distribution  function 
for  T,  with  given  9,  these  probability 
statements  may  be  written 


Fig.  26    Confidence  belt 


(5.3.1) 


F(ti\e': 


F(t2\0')  =  l-s 


where  it  is  assumed  that  0  <  £  <  J. 

If  we  now  imagine  that  there  are  a  great  many  hypothetical  populations 
with  values  of  9  between  a  and  /?,  and  that  for  each  one  the  appropriate  values 
of  t1  and  t2  are  calculated,  the  points  so  obtained  will  lie  on  curves  something 
like  those  marked  CE  and  Cx_£  in  the  diagram.  Since  t  is  supposed  to  be  an 
estimate  of  9,  it  is  reasonable  to  assume  that  both  curves  represent  one-valued 
monotone-increasing  functions.  (If  t  is  any  sort  of  a  reasonable  estimate,  it 
should  increase  as  9  increases.) 

The  region  bounded  by  the  two  curves  and  by  the  lines  9  =  <x  and  9  =  ft  is 
called  a  confidence  belt,  with  confidence  coefficient  1  —  2s.  This  belt  can 
theoretically  be  constructed  from  a  knowledge  of  the  function  F(t\9)  alone. 

Now  suppose  that  for  one  particular  random  sample  (of  size  N)  we  obtain  a 
value  t0  of  T.  The  value  of  9  for  the  population  is  unknown,  except  for  the  fact 
that  it  must  lie  between  a  and  /?.  If  at  t0  we  draw  an  ordinate  cutting  the  curves 
Ce  and  Cx_£  at  9  =  92  and  9  =  9l  respectively,  then  all  points  on  this  ordinate 
between  9X  and  92  lie  inside  the  confidence  belt.  We  see  that  91  is  the  lower 
bound  of  values  of  9  such  that  F(tQ,  9)  <  1  —  £,  and  92  is  the  upper  bound  of 
values  of  9  such  that  F(t0,  9)  >  £.  We  can  therefore  assert,  on  the  basis  of  our 
sample  value  t0,  that  9  lies  between  91  and  92,  and  the  probability  that  this 
claim  is  true  is  1  -  2s.  The  values  9t  and  92  are  the  lower  and  upper  confidence 
limits  for  9,  corresponding  to  the  observed  r0,  and  1  —  2e  is  the  confidence 
coefficient.  The  smaller  the  value  of  s  the  more  confidence  we  shall  feel  in  the 
Tightness  of  our  claim,  but  of  course  the  smaller  we  make  e  the  wider  will  our 
belt  become,  and  therefore  the  greater  will  be  the  interval  92  —  0±.  We  can 
increase  our  confidence  in  a  statement  only  by  making  the  statement  vaguer. 
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In  the  above  illustration  it  was  assumed,  for  convenience,  that  the  variable  T 
concerned  was  continuous.  If  the  variable  is  discrete,  the  curves  Ce  and  C1^e 
will  be  stepped,  as  in  Figure  27,  which  relates  to  the  Poisson  distribution  of  X. 


0    2    4    6    8    10  12  14  16  18  20  22  24  26  28  30 

Fig.  27    Confidence  intervals  for  Poisson  distribution 

For  any  given  ju,  there  will  in  general  be  a  value  of  x2  such  that  P(x2,  a0  <  e 
and  P(x2  —  1,  fi)  >  e,  where  P(x2i  a0  is  the  cumulative  Poisson  function 


P(*2,  li)  =    Z  e 

X  =  X2 


xl 


It  will  happen  for  some  values  of  \x  that  there  is  an  x2  such  thatP(jc2,  \x)  =  e 
exactly.  As  ft  increases  through  such  a  value,  x2  jumps  by  a  unit.  The  horizontal 
portions  of  the  stepped  curve  represent  these  values  of  ju.  Similar  considerations 
apply  to  the  curve  of  xu  which  is  such  that  P(xu  jx)  >  1  —  e  and  P{xx  +  1,  p) 
<  1  -  e. 
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The  diagram  (for  e  =  0.025)  shows  that  if  a  single  sample  value  of  21  is 
observed  for  a  Poisson  variate  X,  the  population  mean  ju  may  be  taken,  with  95  % 
confidence,  to  lie  between  1 3  and  32. 

*  5.4  Fiducial  Inference    If  F(t\0)  is  the  distribution  function  of  T,  and  if  tk  is 
a  value  such  that 


(5.4.1.) 


P(T<tk)=F(tk\6)  =  k 


then  in  a  fraction  1  —  k  of  all  samples  drawn  from  a  population  with  parameter  9 
the  statistic  T  will  exceed  the  critical  value  tk.  This  value  tk  is  a  function  of  0, 
say  K{9),  and  9  is  the  inverse  function  of  tk,  say  ^_1(/fc).  Equation  (1)  may 
therefore  be  written  in  either  form — 


(5.4.2) 


or 


(5.4.3) 


P{T  <  K(9)}  =k 


P{9>K-\tk)}=k 


provided  K(9)  is  a  strictly  monotone-increasing  function  of  9. 

The  form  of  Eq.  (3)  is  the  one  preferred  by  Fisher  and  expresses  what  he  calls 
a  fiducial  probability  for  9.  This  does  not  depend  on  any  assumption  about  the 
distribution  of  9  prior  to  the  examination  of  any  samples. 

If  we  suppose  that  the  statistic  T  has  a  continuous  distribution,  then,  as  we 
have  seen  in  §  4.2,  the  transformed  variate 


(5.4.4) 


Y  =  F(T\9) 


has  a  uniform  (rectangular)  distribution  on  the  interval  0  to  1.  This  means 
that  for  any  fixed  number  k,  between  0  and  1 , 


(5.4.5) 


P{Y  <k}  =k 


F(t 


O 


(^ay^^ 

A./ 

illl 

c 

i                                t 

;* 

b 

Fig.  28    Fiducial  inference 
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Let  us  assume  that  the  possible  values  of  T  form  an  interval  (a,  b)  and  that 
the  possible  values  of  9  form  an  interval  (a,  ft). 

For  any  given  9  we  can  plot  F(t\9)  against  t,  as  in  Figure  28;  F(t\9)  is  the 
probability  that  T  <  t.  In  most  cases  we  shall  find  that  for  a  fixed  t,  say  tk,  the 
values  of  F  decrease  as  9  increases.  The  nearer  9  is  to  a,  the  nearer  F(tk\9)  will 
be  to  1,  and  the  nearer  9  is  to  /?  the  nearer  F{tk\9)  to  0.  If  so,  the  equation 
F(tk\9)  —  k  determines  uniquely  a  value  9k  such  that  9  >  9k  when  F(tk\9)  <  k. 

Equation  (5)  may  therefore  be  written 

(5.4.6)  P(9  >  9k)  =  k  =  F(tk\9k) 
or 

(5.4.7)  P{9  <  0k)  =  1  -  k  =  1  -  F(tk\9k) 

The  quantity  1  —  F(tk\9k)  is  the  fiducial  distribution  function  for  9.  Actually  9k 
in  Eq.  (6)  is  a  random  variable,  determined  by  the  relation  F(tk\9k)  =  k  for  a 
given  k  and  for  the  observed  value  tk  of  the  random  variable  T.  The  probability 
statement  really  concerns  this  random  variable  9k.  By  twisting  the  inequality 
from  the  form  in  Eq.  (2)  to  the  form  in  Eq.  (3)  we  can  make  a  probability  state- 
ment apparently  about  9,  but  this  does  not  convert  9  into  a  random  variable 
(see  further  in  [3]). 

5.5  Confidence  and  Significance  The  determination  of  confidence  intervals 
is  closely  related  to  the  estimation  of  significance.  A  problem  that  sometimes 
arises  in  statistics  is  that  of  judging  whether  a  population  parameter  differs 
appreciably  from  some  value  which  has  been  fixed  beforehand,  perhaps  from 
some  purely  theoretical  considerations.  Suppose  the  theoretical  value  is  90  and 
the  point  estimate  from  a  sample  is  9.  We  need  to  assess  the  significance  of  the 
difference  9  —  90.  If  this  difference  is  greater  (numerically)  than  a  certain 
amount  we  shall  say  that  the  difference  is  significant,  if  less,  that  it  is  non- 
significant. Obviously,  it  is  impossible  to  draw  a  hard-and-fast  line  between 
significance  and  non-significance — there  will  be  border-line  cases  which  are 
difficult  to  classify — but  statisticians  in  general  accept  the  following  convention : 
if  the  probability  of  obtaining  by  chance  a  sample  with  a  difference  numerically 
as  great  as  9  —  90  is  less  than  0.05,  the  observed  difference  is  significant;  if  the 
probability  is  less  than  0.01,  the  difference  is  highly  significant;  if  the  probability 
is  greater  than  0-05  the  difference  is  nonsignificant.  In  border-line  cases  the 
statistician  will  usually  prefer  to  suspend  judgment  and  perhaps  try  to  get  a 
larger  sample. 

If,  having  obtained  the  sample  estimate  0,  we  calculate  the  corresponding 
95%  confidence  interval  for  9,  stretching  say  from  9X  to  02,  there  will  be  a  5% 
probability  that  this  intervaLwill  not  include  90.  In  other  words,  if  90  lies  outside 
the  95  %  confidence  interval  the  difference  9  —  90  will  be  regarded  as  significant. 
Similarly  if  90  lies  outside  the  99  %  confidence  interval  the  difference  will  be 
considered  highly  significant. 

Sometimes  the  statistician  is  faced  with  the  question  of  significance  for  the 
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difference  of  the  estimates  given  by  two  separate  samples.  In  such  a  case  he  will 
choose  zero  as  the  hypothetical  value  for  this  difference,  and  test  whether  the 
observed  difference  is  significantly  different  from  zero.  A  method  of  doing  this 
is  to  construct  the  confidence  interval  for  the  observed  difference  and  see 
whether  this  interval  includes  the  value  zero. 

Some  more  general  considerations  on  the  testing  of  hypotheses  will  be 
discussed  in  Chapter  6. 

5.6  Desirable  Properties  of  an  Estimator  Let  us  suppose  that  we  are  using 
the  statistic  T,  derived  from  a  random  sample  of  TV  observations  xl9  x2  . . .  xN,  in 
order  to  estimate  the  parameter  6  which  occurs  in  the  distribution  function  of  the 
population.  The  estimator  Tis  said  to  be  consistent  if,  as  TV  increases  indefinitely, 
T  tends  (stochastically)  to  the  value  6.  That  is,  for  any  given  e  >  0, 

(5.6.1)  P(\T  -  0\  >  e)  -*  0  as  N  -►  oo 

This  is  an  obvious  common-sense  requirement.  We  should  expect  a  very 
large  sample  to  give  us  practically  the  population  value  of  the  quantity  we  are 
trying  to  estimate. 

A  simple  test  for  determining  consistency  is  provided  by  Chebyshev's 
inequality,  §  2.16.  If  T  is  such  that  E(T)  -*  6  and  V(T)  ->  0  as  N  ->  oo,  then  it 
follows  from  Eq.  (2.16.1)  that  T  is  a  consistent  estimator  of  6. 

The  estimator  Tis  said  to  be  unbiased \£ (even  for  finite  N),  E(T)  =  0,  whatever 
other  parameters  may  occur  in  the  distribution  function.  If  E(T)  merely  tends 
to  0  as  N  ->  oo,  T  is  asymptotically  unbiased.  It  is  generally  desirable  to  use  an 
unbiased  estimator  where  possible,  but  sometimes  other  considerations  are  more 
important. 

The  reliability  of  the  estimate  furnished  by  an  estimator  is  measured  by  the 
reciprocal  of  its  sampling  variance.  The  smaller  this  variance  the  more  reliable 
the  estimate  will  be.  The  efficiency  of  the  estimator  T  is  given  by  comparing  the 
variance  of  T  with  that  of  the  estimator  T0  which,  of  all  possible  consistent  statis- 
tics which  might  be  used  to  estimate  0,  is  the  one  with  minimum  variance.  That 
is,  the  efficiency  of  T  =  V(T0)/V(T).  A  statistic  with  an  efficiency  of  1  (usually 
expressed  as  100%)  is  said  to  be  most  efficient. 

We  shall  now  consider  some  estimators  which  are  used  to  estimate  the 
moments,  cumulants  and  other  parameters  of  a  population.  It  will  be  con- 
venient to  start  with  a  finite  population. 

5.7  Sampling  from  a  Finite  Population  Many  of  the  results  of  sampling 
theory  can  be  obtained  by  supposing  that  a  random  sample  of  size  TV  is  drawn 
from  a  population  of  size  M.  This  enables  us  to  use  the  theory  of  combinations. 
Results  for  an  infinite  population  can  usually  be  obtained  by  letting  M  ->  oo . 

If  X  is  the  variate  measured,  the  arithmetic  mean  of  X  for  a  sample  of  size  TV  is 

(5.7.1)  m^N-'^Xj 

where  Xj  is  the/h  item  in  the  sample. 
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The  corresponding  quantity  for  the  population,  is 


M 


(5.7.2) 


AJ  =  M"1X^ 


where  Xa  is  the  ath  item  in  the  population.  Some  of  the  Xa  will  of  course  be  the 
same  as  the  Xj.  We  shall  use  Xx,  however,  to  mean  any  item  from  the  population 
and  Xj  to  mean  one  that  is  also  in  the  sample. 

As  an  estimator  of  /i,  m  is  clearly  consistent,  since  when  TV  becomes  equal  to 
M  (it  cannot  get  any  larger  than  M),  m  becomes  equal  to  pi.  With  a  finite  popu- 
lation, the  expectation  of  a  statistic  such  as  m  is  defined  as  its  average  over  all 
possible  different  samples  of  size  TV  that  could  be  drawn  from  the  population. 

The  number  of  these  samples  is  (      I .  We  shall  now  show  that  this  average  for 

m  is  equal  to  //,  and  therefore  m  is  an  unbiased  estimator  of  \i. 

Table  5.1 


Sample  No. 

Sample  Items 

Mean  (m) 

1 

2,5 

3.5 

2 

2,5 

3.5 

3 

2,7 

4.5 

4 

2,10 

6.0 

5 

2,21 

11.5 

6 

5,5 

5.0 

7 

5,7 

6.0 

8 

5,7 

6.0 

9 

5,10 

7.5 

10 

5,10 

7.5 

11 

5,21 

13.0 

12 

5,21 

13.0 

13 

7,10 

8.5 

14 

7,21 

14.0 

15 

10,21 

15.5 

125.0 

The  number  of  samples  in  which  any  particular  Xa  occurs  is  equal  to 

I       _  1  I,  since  the  remaining  N  —  1  items  in  the  sample  can  be  picked  from 

any  of  the  other  M  —  1  items  in  the  population.   This  Xa  contributes  XJN  to 
the  value  of  m  for  each  sample  in  which  it  occurs,  and  therefore  its  contribution 

to  the  average  m  over  all  samples  is  (XJN)i      _  .  J  /  j      I  =  XJM.  Summing 

over  all  a,  we  obtain  for  the  average  m  the  amount  £a  XJM,  which  is  jjl.   There- 
fore, 

(5.7.3)  E(m)  =  fx 
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As  an  illustration  involving  small  numbers,  suppose  M  =  6  and  TV  =  2. 

Then  (J  =  15.  Let  the  values  of  Xa  in  the  population  be  2,  5,  5,  7,  10  and  21. 

The  population  mean  is  n  =  50/6  =  8.33.  The  15  possible  samples  of  size  2  and 
their  separate  values  of  m  are  given  in  Table  5.1.  The  average  m  over  all  15 
samples  is  125/15  =  8.33,  which  is  the  same  as  \i.  (In  this  illustration  two  of 
the  population  items  have  the  same  value  of  X,  but  they  count  as  different 
items  in  enumerating  the  samples.) 

A  precisely  similar  proof  may  be  carried  through  for  the  pth  moment  of  X 
about  the  origin,  denoted  by  m' p  for  the  sample  and  by  fi'p  for  the  population. 

A  more  convenient  notation  for  m'p,  suggested  by  Tukey  [4],  is  the  angle 
bracket  </?>.  With  this  notation, 

(5.7.4)  (py=N-1ZXf 

j 

(5.7.5)  £«P»=/<;  =  M-'XX/ 

a 

Let  us  now  consider  a  pair  of  items  Xa9  Xfi  from  the  population.  The  sub- 
scripts distinguish  them  as  different  items,  but  their  actual  values  may  happen  to 
be  the  same  (like  the  two  5's  in  the  illustration  above).  Each  such  pair  appears  in 


":-3 


different  samples.   We  define  the  angle  bracket  (pq}  and  the  corres- 
ponding population  parameter  \i ' pq  by 

(5.7.6)  {pq}  =  [N(N  -  l)]"1  £'  XJXf 

(5.7.7)  ii' „  =  [M(M  -  1)]  " l  X'  X'Xf 

where  the  sum  in  Eq.  (6)  is  over  the  Af(jV  —  1)  pairs  Xh  Xi  in  a  single  sample  of 
size  N.  The  sign  £'  here  indicates  that  the  sum  is  to  be  taken  over  all  different 
values  of  the  subscripts. 

By  considering  the  contribution  of  each  pair  of  items  Xa,  Xp  to  the  average  of 
(pq},  we  can  readily  obtain  the  result 

(5.7.8)  E«pq})=ii'pq 

The  angle  brackets  such  as  {pq}  are  therefore  unbiased  estimators  of  the 
corresponding  population  parameters.  This  is  true  also  of  brackets  with  three, 
four,  or  more  symbols. 

*  5.8  Fisher's  k-Statistics  Unfortunately,  the  sample  moments  about  the 
mean  (of  second  or  higher  orders)  are  not  unbiased  estimators  of  the  corres- 
ponding population  parameters.  However  we  can  define  a  set  of  statistics,  called 
^-statistics,  each  of  which  does  have  this  relationship  to  the  corresponding 
population  parameter.  When  the  population  is  infinite,  these  parameters 
become  identical  with  the  cumulants  discussed  in  §  2.12. 
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In  order  to  calculate  the  /;-statistics  for  a  sample  systematically,  it  is  con- 
venient to  start  with  sums  of  powers  of  the  Xj.  Let 

fsl=YJxi  =iv<i> 

(5.8.1)  S2=£Xf2=iV<2> 

etc. 
Then,  as  shown  in  Appendix  A.ll, 


(5.8.2) 


^(N-1)<H>=S12-S2 
N(N  -  1)<12>  =  SXS2-S3 
N(N  -  1)<13>  =  StS3  -  S4 

N(N  -  1)<22>  =  S22  -  S4 

etc. 

(JV-2)<111>    =S1<11>-2<12> 
(iV-2)<112>    =  S/12)  -  <22>  -  <13> 
(iV  -  3)<1111>  =  S1<111>  -  3<112> 
etc. 
The  ^-statistics  may  then  be  defined  in  terms  of  these  brackets : 


Also, 


(5.8.3) 


(5.8.4) 


/C!=<1> 

Zc2=<2>-<11> 

/c3  =  <3>  -  3<12>  +  2<111> 

fc4  =  <4>  -  4<13>  -  3<22>  +  12<112>  -  6<1111> 


Generalized  ^-statistics  [4]  may  similarly  be  defined,  such  as 

(5.8.5)  \k12  =  <12>  -  <111> 

(/c22  =  <22>  -  2<112>  +  <1111> 

etc. 
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These  serve  as  checks  on  the  calculation  of  the  /c-statistics,  since  the  following 
relations  hold : 

N 


(5.8.6) 


kn  =k, 


k\2  —  k^k2 


Kf7    — 


+  1  \   2        Nj 


Each  k  is  an  unbiased  estimator  of  the  corresponding  quantity  for  the 
population,  which  will  be  denoted  by  k' '.  The  k"s  are  defined  like  the  k's,  except 
that  </>>  is  replaced  by  n'p,  (pq}  by  fi'pq,  etc.  For  an  infinitely  large  population 
they  become  identical  with  the  cumulants  as  previously  defined. 

The  ^-statistics  are  expressible  in  terms  of  the  sample  moments  about  the 
mean,  discussed  in  §  2.8.  The  relations  are: 


(5.8.7) 


ko  = 


N 


N  -  1 


m 


/cv  = 


N 


(N  -  1)(N  -  2) 


m 


(JV 


-JL—^m  +  i)rn4-3(N-l)mQ 


It  is  not,  however,  necessary  to  find  the  moments  first.  The  systematic  pro- 
cedure of  Eqs.  (1)  to  (4)  will  give  the  ^-statistics  directly. 

The  k\  are  similarly  expressible  in  terms  of  the  population  moments  (with 
M  substituted  for  N).  Thus 


(5.8.8) 


K     A      = 


M 
M  - 

M2 

(M  - 

■  1)(M  - 

M2 

2) 

/*3 

(M  -  1)(M  -  2)(M  -  3) 


[(Af  +  IK  -  3(M  -  l)fi22] 


When  M  ->  oo,  k2  =  jU2,  k3  =  /x3,  k4  =  fiA 
definitions  given  in  Eq.  (2.12.5). 


3//22,  in  agreement  with  the 


*  5.9  Computation  of  the  k-Statistics  As  an  illustration  of  the  arithmetic 
involved,  we  will  consider  the  data  of  Table  2.2  already  used  to  calculate  some  of 
the  moments  in  Chapter  2.   If  we  use  an  auxiliary  variable  u  =  (xc  —  45.5)/4, 
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where  xc  is  a  class-mark,  then  u  takes  only  integral  values  from  —  4  to  5  and  the 
calculations  of  sums  of  powers  are  greatly  shortened.  The  whole  work  can  be 
carried  through  in  terms  of  «,  and  at  the  end  we  can  convert  back  to  the  original 
x  values. 

For  this  sample  we  first  find 


Then 


(5.9.1) 


N  =  1000, 


st  =Y.fu  =553 

S2=£V=2471 

S3=I>3=4105 
SA=YJfuA  =  18,407 

<1>=  0.553 
<2>  =  2.471 
<3>=  4.105 
<4>  =  18.407 


<11>: 

(553)2  -  2471 
=                         =  0.3036 
999,000 

<12> 

(553)(2471)  -  4105      ,  ,„„ 
=           999,000           =  L3637 

<13>: 

(553)(4105)  -  18,407      „  „CM 

999,000                —5iJ 

<22>: 

=  (2471)^-18,407 
999,000 

<111>: 

(553)(0.3036)  -  2.7274     fti„c 

998                 _0-1655 

<112>: 

(553)(1.3637)  -  6.0935  -  2.2539 

998 

1111>: 

(553)(0.1655)  -  2.2419     „  „„„, 
-                                      -0.0896 

=  0.7473 


997 
0.553 

2.471-0.3036=2.1674 
fe3=     4.105-4.0911  +0.3310 

=      0.345 
fc4  =    18.407  -  9.016  -  18.280  +  8.968  -  0.538 
=  -0.459 
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Also, 

'kn  =0.3036 

kl2  =  1.3637  -0.1655  =  1.1982 

k22  =  6.0935  -  1.4946  +  0.0896  =  4.6885 

and  the  checks  of  Eq.  (5.8.6)  hold,  apart  perhaps  from  a  small  rounding-off 
error  in  the  last  decimal  place. 

Finally  we  can  convert  the  &'s  back  to  the  original  units  (pounds)  by  writing 

(kx  =  4(0.553)  +  45.5  =  47.71  lb 
/c2=42(2.1674)  =34.68  lb2 

k3  =  43(0.345)  =22.1  lb3 

kA=  -44(0.459)        =  -1181b4 


(5.9.2) 


Using  these  as  estimators  of  the  cumulants  for  the  population  from  which  the 
sample  was  taken,  we  have  the  following  estimated  values  of  the  population 
parameters : 


(Kl  =  p  =  47.71  lb 
K    =  G2  =  34.68  lb2  (<r  =  5.889  lb) 


(5.9.3) 


■2 
*3 


3/2 


=  7] 


0.108 


K22 


y2  =  -0.098 


*  5.10  Sheppard's  Corrections  The  error  due  to  grouping  the  frequencies 
at  the  mid-points  of  the  class-intervals,  in  the  computation  of  the  ^-statistics, 
may  be  approximately  allowed  for  by  using  some  corrections  first  suggested  by 
Sheppard.  These  corrections  are  applied  to  the  even-order  ^-statistics  only,  and 
are  given  by  the  relation 


(5.10.1) 


(kr)c  =  K-c'^ 


r>2 


where  (kr)c  is  the  corrected  value  of  kri  c  is  the  class-interval,  and  Br  is  the  rth 
Bernoulli  number  (see  Appendix  A.  12).  For  the  first  two  even-order  ^-statistics 
these  corrections  are 


(5.10.2) 


(k2)=k2 


(k4)c  =  k4  + 


12 
120 
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In  practice  the  corrections  are  most  easily  applied  to  the  kn  as  first  obtained  in 
the  u  units  (for  which  c  =  1).  Thus  from  Eq.  (5.9. 1)  the  corrected  values  are 

(k2)c  =  2.1674  -  0.0833  =  2.0841 
(kA)c  =  -0.459  +  0.008  =  -0.451 

Using  these,  our  estimated  k2  and  ka  become 

k2  =  33.34  lb2,        /c4  =  -1151b4 

and  instead  of  the  values  given  in  Eq.  (5.9.3)  we  find  the  following  estimates: 

(o  =5.774  lb 
(5.10.3)  7i  =0.114 

[y2  =  -0.104 

Sheppard's  corrections  should  not  be  used  unless  the  frequency  curve  appears 
to  have  a  single  mode  and  tails  off  gradually  at  both  ends.  Moreover,  unless  the 
sample  consists  of  at  least  several  hundred  items,  the  uncertainties  due  to 
sampling  fluctuation  are  likely  to  overshadow  the  corrections.  When  the  cor- 
rections are  applicable,  however,  their  use  will  generally  (although  not  invariably) 
improve  the  estimates  of  the  population  parameters,  and  they  are  so  easy  to 
apply  that  it  is  usually  worth  while  to  take  the  slight  additional  trouble  involved. 

5.11  Variance  and  Covariance  of  the  k-Statistics  As  before,  when  dealing 
with  a  finite  population,  we  interpret  the  expectation  of  a  statistic  as  its  average 
taken  over  all  possible  different  samples  of  the  same  size.  The  variance  of  kl  will 
then  be  defirfed  as 

(5.11.1)  V(ki)  =  Eik,2)  -  {Eik,)}2 
By  the  results  obtained  in  §  5.8  we  know  that 

(5.11.2)  E(k1)=Kf1=fi 
Also,  from  the  first  equation  of  (5.8.6), 


(5.11.3) 

k,2=k11+^ 

so  that 

(5.11.4) 

E(k12)=E(kil)  +  N-1E(k2) 

It  follows  that 

=  k'h  +N~1k'2 

(5.11.5) 

V(k1)  =  K,ll  +N-1k'2-{k\ 

,2 


The  corresponding  equation  to  (3)  for  the  population  is 
(5.11.6)  (k\)2  =k\1  +k'2IM 
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so  that  Eq.  (5)  becomes 

(5.11.7)  V(kx)  =  k'^N'1  -  M_1) 

For  an  infinitely  great  population,  M~x  ->  0,  and  we  have  the  simple  result 


(5.11.8) 


V(k1)  =  K2N~l  =- 


This  measures  the  sampling  fluctuation  in  the  value  of  the  arithmetic  mean  kx . 
In  practice  we  generally  do  not  know  o2  except  insofar  as  we  can  estimate  it 
from  the  one  sample  which  gives  us  kv  If  we  replace  a2  by  the  corresponding 
unbiased  estimator  k2,  we  have  as  an  estimator  of  the  variance  of  kx  the  statistic 


(5.11.9) 


9(kx)  = 


N 


The  square  root  of  9{kx)  is  called  the  standard  error  of  kv  In  terms  of  the 
sums  of  powers  of  X,  defined  in  §  5.8, 


(5.11.10) 


P(*i)  = 


Si 


N       N(N  -  1) 

_S2-  St2/N 
N  -  1 

The  variance  of  k2  is  similarly  given  by 

(5.11.11)  F(k2)=.E(/c22)-{£(fc2)}2 

From  the  last  equation  of  (5.8.6), 

AT  +  1    ,  1     , 


and  also, 
Since 


E(k2)  =  K'2 
M-  1 


K   79    — 


M  +  l 
we  find,  after  a  little  rearrangement,  that 


(K'2)2-^ 


K' 

M 


<51112>     ™ = (M+MixN-i)h^2 + *'<  ('  ~  h  ~  h  ~  m)} 
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which  for  an  infinitely  great  population  reduces  to , 

(5.11.13)  K(/c2)=^-TK:22+ifc4 

If  the  parent  population  is  normal,  ka  =  0  and  k2  =  a2.   In  this  case, 

2<74 


(5.11.14)  V(k2)  = 


N-l 


In  order  to  find  the  standard  error  of  k2,  the  unknown  population  parameters 
must  be  replaced  by  sample  estimators.   It  is  easily  verified  that,  for  M  ->  oo, 


(5.11.15)  EJtt^H-  k22  +  7777; 77  K\  = k22  +  -f-  k4 


N-l         }        _2_      2      j_ 

JV  +  1  '*     '  N(N  +  1)    4|  ~  AT  -  1  Kl    +  N 

so  that  we  can  take  the  expression  in  braces  on  the  left-hand  side  as  an  unbiased 
estimator  of  V(k2).  Therefore, 

(5.11.16)  nk2)=^L-^+^l-)K 

and  the  square  root  of  this  is  the  standard  error  of  k2. 
The  covariance  of  k±  and  k2  may  be  defined  as 

(5.11.17)  C(kit  k2)  =  E(kxk2)  -  £(/c1)-£(fc2) 
By  the  second  equation  of  (5.8.6) 

E(k3) 


E(klk2)=E(k12)  + 


N 


so  that 


(5.11.18)  C(ku  k2)  =  k\2  +  ^  -  k\k'2 

=  \k'iK'2~m)+^~k'iK'2 

=  k'3(N~1  -M_1) 
For  an  infinitely  great  population, 

(5.11.19)  C(kuk2)=K3N-1 

which  is  zero  for  a  normal  population. 

The  first  two  ^-statistics  are  therefore  uncorrelated  in  samples  from  a  normal 
population,  although  this  is  not  true  for  skew  populations.  Since  k3  is  an  un- 
biased estimator  of  /c3, 

(5.11.20)  C{kuk2)  =  k3N~l 
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5.12  The  Distribution  of  the  Sample  Mean  The  arithmetic  mean  of  a 
sample  of  N  observations,  Xu  X2  .  .  .  XN,  is  the  first  ^-statistic  kv  As  men- 
tioned previously,  the  expectation  of  kt  in  samples  from  a  finite  population 
is  the  population  mean  n  (which  is  k\).  That  is, 

(5.12.1)  E(k1)=K\  =[i 

Also  the  variance  of  ku  from  Eqs.  (5.11.7)  and  (5.8.8),  is  given  by 

,    ,        ,  M-N        M-N      2 

(5.12.2)  Kfcj-^ -_-__.. 

which  for  an  infinite  population  becomes 

(5.12.3)  V(kd  =  a2jN 

Similar  arguments  to  those  used  in  §  5.1 1  (based  on  relations  between  angle 
brackets  and  ^-statistics)  can  be  used  to  obtain  the  higher  moments  of  the 
distribution  of  kl9  but  the  calculations  soon  become  quite  complicated.  It  turns 
out  that  the  skewness  is  given  by 

and  the  kurtosis  by 

(5.12.5)         lttt(fcj)  =  [(M  -  1)(M2  -  6MN  +  M  +  6N2)y2 

-  6M(MN  +  M  -  N2  -  1)] 
-r  [N(M  -  2)(M  -  3)(M  -  iV)] 

where  yx  =  k3/k23/2  and  y2  =  k4/k22. 

For  an  infinite  population  these  reduce  to 

7i 


(5.12.6)  Sk(kj)  = 

(5.12.7)  Ku{kx)  = 


72 


N 


It  is  evident  from  Eqs.  (6)  and  (7)  that  for  large  enough  samples  the  skewness 
and  kurtosis  of  the  distribution  of  kx  will  be  nearly  zero,  whatever  the  corres- 
ponding quantities  for  the  population  (as  long  as  they  are  finite).  This  suggests 
that  the  mean  of  a  large  sample  from  almost  any  kind  of  population  will  have  a 
distribution  close  to  normal,  and  in  fact,  if  certain  conditions  are  satisfied,  this 
result  follows  from  the  Central  Limit  Theorem  (see  §  4.10). 

If  the  parent  population  is  normal,  the  mean  of  a  sample  of  size  TV  is  also 
normally  distributed,  with  variance  <t2/N,  whether  N  is  large  or  small.  (For  a 
proof  of  this,  see  §  8.2.)  If  the  parent  population  is  not  normal  but  has  a  finite 
variance  a2,  the  variance  of  the  sample  mean  is  still  g2JN  and  for  large  N  the 
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distribution  is  approximately  normal.  For  a  parent  population  not  too  wildly 
skew,  a  sample  size  of  30  or  more  will  usually  give  a  satisfactory  approximation 
to  normality. 

As  an  illustration  with  even  smaller  sample  size,  a  decidedly  skew  population 
was  constructed  by  writing  a  number  from  0  to  24  on  each  of  1000  circular 
metal-edged  cardboard  tags.  The  frequency  diagram  for  this  population  is 
shown  in  Figure  29.  (There  were  106  tags,  for  example,  marked  4.)  The  num- 
bered discs  were  put  into  a  goldfish  bowl  and  well  mixed.  A  sample  of  10  discs 

700 


Fig.  29    Frequency  polygons  for  a  skew  population  and  for  the 
means  of  samples  of  10 


was  drawn  and  the  numbers  were  noted  before  the  discs  were  replaced.  This 
was  done  repeatedly,  and  over  a  considerable  period  of  time  2500  sample  means 
were  obtained.  These  were  grouped  in  classes  3.0  to  3.9,  4.0  to  4.9,  etc.  and  the 
first  few  ^-statistics  were  calculated.  The  frequency  polygon  of  the  distribution 
of  these  sample  means  is  shown  in  Figure  29  along  with  that  for  the  parent 
population  (the  two  polygons  have  different  vertical  scales,  one  shown  on  the 
right  of  the  diagram  and  one  on  the  left).  The  much  more  symmetrical  nature 
of  the  distribution  of  means  is  obvious  at  a  glance.  Table  5.2  gives  for  com- 
parison (a)  the  actual  characteristics  (population  parameters)  for  the  parent 
population  of  1000  discs,  (b)  the  theoretical  characteristics  for  the  distribution  of 
mean  in  all  possible  samples  of  10,  (c)  the  estimated  values  for  these  characteris- 
tics derived  from  the  ^-statistics  of  2500  actual  samples,  (d)  the  approximate 
standard  errors  for  these  estimates.  For  the  skewness  and  kurtosis  the  standard 
errors  relate  to  a  normal  parent  population  (see  Chapter  8)  and  are  not  very 
reliable. 
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Characteristic 

Population 

Population  of  Means 
(Theoretical) 

Population  of  Means 
(Estimated) 

Mean 
Variance 
Skewness 
Kurtosis 

7.601 
19.57 
0.896 
0.508 

7.601 
1.939 
0.279 
0.042 

7.640  ±  0.028 
2.006  ±  0.058 
0.381  ±  0.049 
0.095  ±  0.098 

It  will  be  seen  that  in  all  cases  the  estimated  values  agree  with  the  theoretical 
values  within  about  once  or  twice  the  standard  error.  The  difference  for  the 
skewness  is  slightly  more  than  twice  its  standard  error. 

5.13  Confidence  Interval  for  the  Mean  (in  Large  Samples)  If  we  apply  the 
procedure  of  §  5.3  to  the  statistic  m  (the  sample  mean)  as  used  to  estimate  the 
parameter  \i  (the  population  mean),  we  obtain  a  confidence  belt  which  for  fairly 
large  samples  is  of  almost  uniform  width.  For  a  given  value  of  ft,  the  expected 
value  of  m  will  be  [i  and  its  variance  will  be  a2/N.  If  the  sample  size  is  large 
enough  for  the  distribution  of  the  mean  to  be  regarded  as  normal,  or  if  the  parent 
population  is  known  to  be  normal,  the  sample  mean  for  given  \i  will,  with 
probability  0.95,  lie  between  \i  -  \.96(tN~1/2  and  \i  +  1.96aA^~1/2.  It  follows 


96  a  N 


m-1.96 


Fig.  30    Confidence  belt  for  the  sample  mean 
with  known  population  variance 

that  for  a  given  m,  the  95  %  confidence  interval  for  \i  lies  between  m  —  1  .96gN ~ 1/2 
and  m  +  \.96oN~112,  if  o  is  known  (Figure  30).  If  o  is  not  known,  it  may  be 
replaced  by  an  estimate  such  as  the  sample  standard  deviation.  There  is,  however, 
a  better  procedure  available  when  o  has  to  be  estimated  from  a  fairly  small 
sample  and  when  the  parent  population  can  be  taken  as  normal.  This  procedure 
will  be  described  in  §  8.5. 

Example  1     For  a  sample  of  345  11 -year-old  boys,  the  mean  weight  was 
found  to  be  74.71  lb  and  the  standard  deviation  10.65  lb.    Calculate  98°/ 
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confidence  limits  for  the  mean  weight  in  the  population  of  1 1 -year-old  boys  from 
which  this  sample  was  taken. 

Here  m  =  74.71  lb,  and  s  =  10.65  lb.  Using  s  to  estimate  g  and  noting  that 
for  the  standard  normal  law  the  98  %  limits  are  at  +  2.326,  we  find  for  \i  the  con- 
fidence limits  74.71  ±  2.326(10.65)/(345)1/2  =  74.71  ±  1.33  lb,  or  73.38  to 
76.04  lb. 

Example  2  The  variable  X  is  the  lifetime  (in  days)  of  test  pieces  of  metal 
sheet  immersed  in  tap  water,  before  failure  due  to  corrosion.  From  a  large 
number  of  trials  the  mean  value  (/x)  of  X  was  found  to  be  875,  with  a  standard 
deviation  of  85.  For  further  routine  testing,  how  large  should  the  samples  be  if 
the  average  life  (m)  from  such  a  sample  is  to  differ  from  \i  by  not  more  than 
5%,  with  probability  0.90? 

Since  5%  of  ,u  is  43.75,  the  requirement  is  that  P(\m  -  fi\  <  43.75)  =  0.90. 
Assuming  a  normal  distribution,  the  probability  0.90  corresponds  to  a  standar- 
dized variate  of  1.645.  Therefore,  {A2>.15)I{gN~112)  =  1.645,  with  o  =  85.  This 
gives  N  =  10. 


->-  No.  of  Successes 


Fig.  31     Confidence  limits  for  the  parameter 
of  a  binomial  distribution 


5.14  Confidence  Limits  for  the  Probability  of  Success  in  a  Binomial  Population 

If  X  is  the  number  of  successes  in  N  trials,  the  probability  of  success  in  each  trial 
being  0,  we  know  from  §  3.3  that 


(5.14.1) 

and 

(5.14.2) 


E(X)  =  NO 


V(X)  =  N0(1  -  6) 
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If  N  is  fairly  large  and  0  not  too  near  0  or  1,  the  distribution  of  X  is  approxi- 
mately normal,  particularly  if  we  make  the  correction  for  continuity  mentioned 
in  §  3.11.  Thus  if  0„  9U  are  the  lower  and  upper  95%  confidence  limits  for  0, 
we  have  (see  Figure  31) 

(5.14.3)  X-i-NOt  =  1.96[A/0,(1  -  0,)]1/2 
and 

(5.14.4)  N9U  -  (X  +  i)  =  1.96[A/0M(1  -  0U)]1/2 

These  two  equations  give  0,  and  0H,  respectively,  as  the  solution  of  a  quadratic 
equation. 

If  TV  is  quite  large,  it  will  often  be  sufficient  to  replace  0,  or  9U  on  the  right- 
hand  side  of  Eqs.  (3)  and  (4)  by  the  sample  proportion  X/N,  and  to  ignore  the 
continuity  correction.  If  we  do  so,  the  approximate  confidence  limits  are  given  by 


(5.14.5) 

N9l=X -1.96  X1/2(l- 

X\112 

"  n) 

and 

(5.14.6) 

N9U=X  +  1.96  X1/2(l- 

X\1/2 

~n) 

Example  3  If  in  400  binomial  trials  we  find  280  successes,  what  are  the  95  % 
confidence  limits  for  0? 

(a)  The  approximate  limits  given  by  Eqs.  (5)  and  (6)  are 

4000,  =  280  -  1.96[280(0.30)]1/2  =  262 

0,  =  0.655 
and 

4OO0M  =  280  +  1.96[280(0.30)]1/2  =  298 

9U  =  0.745 

(b)  From  Eq.  (3),  on  squaring  both  sides,  we  obtain 

(279.5  -  4OO0j)2  =  (3.84)(4OO)0,(1  -  0/) 

which,  on  collecting  terms  and  dividing  by  the  coefficient  of  0,2,  becomes 

0,2- 1.39370, +0.4836=0 

The  solution  of  this  quadratic  gives  as  the  smaller  root  (the  only  one  that 
satisfies  the  original  equation  before  squaring)  0,  =  0.652.  Eq.  (4)  similarly  gives 
the  quadratic  equation 

0U2  -  1.39870u  +  0.4871=0 

the  larger  root  of  which  is  0.744.  In  this  example  the  approximate  method  gives 
almost  as  good  results  as  the  more  exact  one. 
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5.15  Confidence  Limits  for  the  Difference  of  Probabilities  in  Two  Binomial 
Populations  If  we  are  given  two  fairly  large  samples  which  we  suspect  may 
come  from  two  binomial  populations  with  different  parameters  6t  and  02,  we 
can  similarly  construct  confidence  limits  for  the  difference  of  these  parameters. 
If  the  confidence  interval  includes  the  value  zero,  the  inference  will  be  that  the 
parameters  are  not  significantly  different  (at  the  level  of  significance  determined 
by  the  confidence  coefficient). 

Let  us  suppose  that  d  is  the  difference  of  the  two  sample  proportions  of 
successes : 

(5.15.1)  d  =  pi_p2=|i_|i 

Then,  by  Theorem  1.16  and  Bienayme's  Theorem  (§  2.14),  we  have 

(5.15.2)  E(d)  =  E(Pl)  -  E(p2)  =  6,-02 

(5.15.3)  V(d)  =  V(Pl)  +  V(p2) 

=  Nr^i(i  -  #i)  +  n2~ l  e2(i  -  e2) 

If  the  samples  are  large  enough  that  we  may  use  the  normal  approximation,  d 
will  also  be  approximately  normal,  and 

,-,-„  d  -  (flt  -  62) ^ 

p.i:>.4)  [jvr^o  -  0i)  +  n2-%(i  -  e2)Y>2~ z 

For  the  95%  limits  we  may  put  z  =  ±  1.96,  and  solve  for  0X  —  62.  Since 
we  do  not  know  Qx  and  92  separately,  we  must  replace  them  in  the  denominator  of 
Eq.  (4)  by  their  estimators,  px  and p2.  The  95  %  confidence  limits  are  then  given 
approximately  by 

(5.15.5)  0l-02=d±  1.96[JVrVi(l  -  Pi)  +  Nf'p^l  -  p^'2 

Example  4  A  company  selling  "XX"  tires  conducted  a  survey  among  car 
owners  in  each  of  two  districts,  A  and  B.  In  district  A,  750  persons  said  they 
planned  to  purchase  tires  shortly  and  300  said  they  intended  to  buy  the  XX 
brand.  In  district  B,  600  persons  planned  to  purchase  tires  and  210  intended  to 
get  XX  tires.  Does  there  appear  to  be  a  significant  difference  between  districts 
A  and  B  with  regard  to  the  proportions  of  prospective  XX  purchasers? 

Here  d  =  0.40  -  0.35  =  0.05.  The  approximate  standard  error  of  d  is 

(0.40)(0.60)  +  (0.35X0.65)1 1/2  =  Q  ^ 


750  600 

so  that 

01-02=  0.05  ±  0.052 

=  -0.002  to  0.102 
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Since  the  interval  includes  zero  (although  only  just)  we  can  say  that  at  the  5  % 
level  the  observed  difference  d  is  not  significant.  The  result  is,  however,  close  to 
the  borderline  of  significance. 

*  5.16  Sampling  for  Proportion  of  Successes  from  a  Finite  Population    If  the 

sample  of  size  N  is  drawn  "without  replacements"  from  a  finite  population  of 
size  M,  the  distribution  of  the  sample  proportion  of  successes  p  is  not  binomial 
but  hypergeometric  (see  §  3.5).  The  expectation  and  variance  of/?  are  given  by 

E(p)=0 
(5.16.1)  {  M-N 

For  large  N  (and  of  course  still  larger  M\  the  distribution  of  p  is  approximately 
normal,  and  confidence  limits  for  6  may  be  determined  as  in  §  5.14,  with  the 
appropriate  correction  for  the  variance. 

*  5.17  Use  of  Binomial  Probability  Paper  A  special  graph  paper,  designed 
by  Mosteller  and  Tukey  [5],  may  be  used  to  obtain  quick  approximate  solutions 
of  estimation  problems  involving  binomial  populations.  The  scheme  is  based  on 
Fisher's  angular  transformation  (see  §  3.15),  p  =  sin2^,  which  has  the  effect  of 
making  the  variance  of  A  a  function  of  the  sample  size  only  (proportional  to 
l/N)  and  also  of  improving  the  approximation  to  normality.  A  specimen  of  this 
graph  paper  is  shown  in  Figure  32. 

The  scales  of  x  and  y  are  square-root  scales.  The  horizontal  distance  of  a 
point  marked  x  from  the  origin  is  proportional  to  x1/2,  and  similarly  for  y.  A 
quarter-circle  is  drawn  through  the  points  marked  100  on  each  axis,  and  on  this 
circle  x  +  y  —  100.  The  angles  A,  in  degrees,  are  marked  on  this  circle  and  the 
abscissa  of  a  point  A  is  the  corresponding/?  (multiplied  by  100).  At  a  distance  of 
V  N  from  the  origin,  in  a  direction  given  by  A,  the  variance  of  A  on  the  circle  of 
radius  \J  N  is  practically  constant,  independent  both  of  TV  and  of  6.  Any  straight 
line  through  the  origin  passes  through  points  for  which  y/x  is  constant,  and  is 
called  a  split.  A  40-60  split,  for  example,  passes  through  the  point  jc  =  40, 
y  =  60. 

Suppose  that  in  a  sample  of  10  we  find  7  "successes,"  and  therefore  3 
"failures."  We  say  that  the  paired  count  for  the  sample  is  (7,3)  and  plot  it  as  a 
right-angled  triangle  with  the  right  angle  at  (7,3)  and  the  sides  each  one  unit  long, 
parallel  to  the  axes.  When  one  of  the  coordinates  is  larger  than  about  100, 
the  one-unit  length  is  scarcely  more  than  the  width  of  a  pencil  line. 

In  order  to  test  whether  the  observed  value  of/?  (7/10)  is  significantly  different 
from  a  hypothetical  9  (say  1/2),  we  measure  the  perpendicular  distance  from 
the  plotted  triangle  to  the  50-50  split.  When  the  numbers  jc  and  y  are  small,  there 
are  two  distances,  called  the  short  and  the  long  distance,  measured  from  the  two 
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acute  angles  of  the  triangle,  and  these  are  interpreted  by  reference  to  the  scale 
at  the  top  of  the  paper  (marked  Full  Scale).  A  distance  of  one  unit  on  this  scale 
corresponds  to  a  standard  normal  deviate  of  1,  so  that  a  distance  of  two  units  on 
the  scale  represents  very  nearly  the  5  %  level  of  significance  (when  we  are  inter- 
ested in  the  magnitude  of  the  difference  between/?  and  6  rather  than  in  the  sign). 
The  long  and  short  distances  each  give  a  significance  level  and  the  observed 
result  must  be  regarded  as  significant  at  some  level  in  between.  In  the  illustration 
above,  the  two  distances  are  1 .6  and  1 .0,  so  that  the  observed  p  is  not  significantly 
different  from  1/2  at  the  5%  level  of  significance. 
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Fig.  32    Binomial  probability  graph  paper 


Example  5  In  an  opinion  poll  124  "yes"  answers  were  received  to  a  certain 
question  out  of  200  replies.  Find  95  %  confidence  limits  for  the  true  proportion 
of  persons  who  would  answer  "yes"  in  the  population  sampled. 

The  paired  count  is  (124,76),  and  this  is  plotted  as  P  in  Figure  32  (the  triangle 
is  practically  a  point).  Two  splits  are  drawn  such  that  they  lie  at  perpendicular 
distances  of  two  scale  units  from  P.  These  splits  cut  the  quarter-circle  at  (55,  45) 
and  (69,  31)  so  that  the  95  %  confidence  limits  for  0  are  0.55  and  0.69. 
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5.18  Confidence  Limits  for  the  Parameter  of  a  Binomial  Distribution  with 
Small  Samples  The  normal  approximation  is  not  really  justified  for  small 
samples,  particularly  when  0  is  not  close  to  0.5.  By  the  use  of  cumulative 
binomial  tables  (such  as  those  mentioned  in  References  [3]  and  [4]  of  Chapter  3) 
it  is  possible  to  determine  the  parameter  0t  of  a  binomial  such  that,  for  example, 
the  observed  value  of  X  cuts  off  the  upper  2|  %  tail  of  the  distribution.  In  the 
same  way,  9U  can  be  found  such  that  the  same  A' cuts  off  the  lower  2 J  %,  X  itself 
being  included  in  the  tail  (see  Figure  31).  These  values  6t  and  0M  give  the  95% 
confidence  limits  for  0. 

Thus  if  N  =  20  and  X  =  6,  we  find  from  the  tables  that  5(6,  20,  0.11) 
=  0.01755  and  5(6,20,0.12)  =  0.02602.  By  interpolation,  the  value  of  0, 
corresponding  to  5(6, 20,  0,)  =  0.025  is  about  0.119.  Also  5(7,20,0.55) 
=  0.97859  and  5(7,  20,  0.54)  =  0.97349,  giving  0M,  corresponding  to  0.975, 
as  0.543.   The  95%  confidence  limits  for  0  are  therefore  0.119  and  0.543. 

It  may  be  noted  for  comparison  that  the  approximate  method  of  Eqs. 
(5.14.5)  and  (5.14.6)  gives  0.099  and  0.511.  The  method  of  Eqs.  (5.14.3)  and 
(5.14.4)  gives  0.128  and  0.543,  so  that  even  with  an  N  as  small  as  20  the  normal 
approximation,  with  a  continuity  correction,  is  fairly  satisfactory. 

Mention  may  be  made  of  special  tables  [6]  by  Mainland  and  others,  pre- 
pared for  the  Department  of  Medical  Statistics  at  New  York  University  College 
of  Medicine.  These  give  95  %  and  99  %  confidence  limits  for  0  for  a  considerable 
range  of  sample  sizes  and  observed  proportions,  and  include  all  cases  that  are 
likely  to  arise  in  practical  statistics. 


PROBLEMS 

A.  (§§  5.1-5.6) 

1.  The  variate  X  is  distributed  in  a  population  with  density  f(x \6)  =  2(6  —  x)/02, 
0  <  x  <  0.  It  is  desired  to  estimate  0  from  a  single  observation  by  using  the  statistic 
T  =  2X.  Write  down  the  density  function  for  T,  integrate  to  find  F(t\0),  and  calculate 
the  values  of  fi  and  H  from  Eq.  (5.3.1)  when  e  =  0.05.  Plot  the  curves  Ce  and  Ci_e, 
for  6  <  1.  If  the  observed  x  is  0.02,  find  90%  confidence  limits  for  6.  Hint:  ti  and  U 
are  given  by  solutions  of  quadratic  equations  for  ti/0  and  u\0.  In  each  case  only  one 
solution  is  possible,  since  t  <  29. 

2.  If  X  is  uniformly  distributed  on  an  interval  of  unit  length  with  centre  at  x  —  6, 
an  estimator  for  6  is  the  mid-range  of  a  sample,  that  is,  the  mean  of  the  smallest  and  the 
largest  observed  values.  If  Tis  this  estimator  for  a  sample  of  size  4,  the  density  function 
for  T  is  f(t)  =  32(0.5  -  \t  -  0|)3.  If  0  =  1,  calculate  the  values  of  ti  and  H  corres- 
ponding to  e  =  0.025.  Sketch  the  confidence  belt  for  6  and  find  95  %  confidence 
limits  corresponding  to  an  observed  t  —  1.2.  Hint:  If  0  =  1,  0.5  <  t  <  1.5.  Treat 
the  cases  t  <  1  and  t  >  1  separately. 

3.  If  X  is  normally  distributed  with  mean  /x  and  variance  <r2,  the  arithmetic  mean 
X  of  a  sample  of  size  N  is  normally  distributed  with  mean  \x  and  variance  a2jN.  If  X 
is  used  as  an  estimator  of  /x,  find  99  %  confidence  limits  for  fi  corresponding  to  an  ob- 
served value  of  X.  (The  variance  a2  is  assumed  to  be  known.) 
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4.  _Assuming  the  distribution  of  the  arithmetic  mean,,  as  given  in  Problem  3,  show 
that  X  is  a  consistent  and  unbiased  estimator  of  p. 

5.  The  distribution  of  the  mid-range  of  a  sample  of  size  N  for  a  uniform  distribution 
of  X  on  (0  -  h  0  +  £)  has  the  density  function  f(t)  =  N2N~1(i  -  \t  -  0D*"1. 
(Compare  Problem  A-2,  for  iV  =  4.)  Show  that  the  mid-range  is  a  consistent  and  un- 
biased estimator  of  0.  Hint:  Prove  that  E(T  -  9)  =  0  and  £(r  -  0)2  =  [2(N  +  1) 
(N  +  2)]_1.  Treat  separately  the  integrals  from  0  —  \  to  0  and  from  0  to  0  +  £. 

6.  If  ^  has  a  rectangular  (uniform)  distribution  on  the  interval  (0,9)  and  if  R  is  the 
range  for  a  sample  of  size  JV  (the  highest  value  xn  minus  the  lowest  value  xi),  the 
distribution  function  for  R,  for  a  given  0,  is  F(r|0)  =  (r/9)N(N9/r  -  N  +  1).  Show 
that  the  statistic  T  =  i?/0  has  a  distribution  independent  of  0  and  that  fiducial  limits  for 
0  with  confidence  coefficient  a  are  given  by  xn  and  /?//„,  where  a  is  the  probability  that 
T  >  ta.  Hint:  a  is  tjie  probability  that  ta  <  R/9  <  1,  i.e.,  that  1  <  9/R  <  ta-\  Write 
this  as  a  fiducial  probability  for  0.  For  given  a,  fa  is  the  root  of  an  equation  of  degree  N. 
Note  that  0  must  be  at  least  equal  to  xn. 

7.  For  the  rectangular  distribution  f(x)  =  9~1,  0  <  x  <  0,  prove  that  fiducial 
limits  for  0  with  coefficient  a,  based  on  a  sample  of  size  two  with  values  xi  and  X2,  are 
X2  and  (*2  -  xi)/(\  -  a1/2).  #wtf:  Use  Problem  6  with  N  =  2. 

8.  For  the  same  distribution  as  in  Problem  7,  an  estimator  of  0,  based  on  a  sample 
of  two,  is  xi  +  *2.  The  density  function  is /(f)  =  //02,  for  /  <  0  and/(/)  =  (20  -  t)/92 
for  t  >  9.  Show  that  confidence  limits  for  0,  with  confidence  coefficient  a,  are  given  by 
(xi  +  x2)/[2  -  (1  -  a)1/2]  and  (*i  +  jc2)/(1  -  a)1/2,  except  that  when  the  lower  limit 
is  below  X2  it  must  be  replaced  by  X2. 

Work  out  numerical  values  if  xi  =  3,  X2  =  5,  and  a  =  0.9.  Compare  these  limits 
with  those  given  by  Problem  7  for  the  same  data. 

9.  (a)  A  sample  of  Af  objects  is  taken  from  a  large  binomial  population  in  which  a 
proportion  0  of  the  objects  possess  a  certain  attribute  A.  If  /?  is  the  proportion  of 
objects  possessing  this  attribute  in  the  sample,  show  that  pql(N  —  1)  is  an  unbiased 
estimator  of  0(1  —  9)1  N,  where  q  =  1  —p. 

(b)  Suppose  that  the  sample  is  selected,  one  item  at  a  time,  until  m  of  the  selected 
items  are  ,4's.  Calculate  the  probability  that  the  size  of  the  sample  is  N,  and  show  that 
(m  —  l)/(N  —  1)  is  an  unbiased  estimator  of  0.  Hint:  Find  the  probability  that  in  the 
first  N  —  1  items  there  are  m  —  1  A's  and  that  the  Nth  item  is  an  A.  The  distribution 
of  N  —  m  is  negative  binomial. 

B.  (§§5.7-5.11) 

1.  WYite  out  the  proof  of  the  statement  in  Eq.  (5.7.8),  that  E((pq  »  =  [x!vq. 

2.  In  the  following  table,  X  represents  the  number  of  defective  items  produced 
by  a  machine  in  one  day's  operation,  and /is  the  frequency  of  occurrence  of  X  over  a 
period  of  200  days.  Compute  the  first  four  /^-statistics  for  this  empirical  distribution, 
which  is  roughly  Poisson.  (Note  that  X  is  discrete.  There  is  no  occasion  to  use  either 
an  auxiliary  variable  or  Sheppard's  corrections.) 


X 

/ 

0 

102 

1 

59 

2 

31 

3 

8 

4  or  more 

0 

200 

3.  Find  the  standard  errors  of  k\  and  k2  and  the  estimated  covariance  of  ki  and  kz 
for  the  data  of  Problem  2. 
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4.  Find  the  standard  errors  of  ki  and  ^as  calculated  for  the  u  variable,  for  the 
data  on  weights  used  in  §§  5.9  and  5.10.  (Use  the  corrected  values  of  k%  and  &4.) 

C.  (§§  5.12-5.16) 

1.  A  normal  population  has  mean  20  and  standard  deviation  2.  A  sample  of  six 
items  from  the  population  has  a  mean  18.2.  Can  the  sample  be  reasonably  regarded 
as  a  random  one,  using  the  5%  level  of  significance?  Hint:  Calculate  the  probability 
that  a  random  sample  would  have  a  mean  differing  from  20  by  as  much  as  1.8. 
Alternatively,  find  95%  confidence  limits  for  \x  and  see  whether  these  include  the 
true  value  20. 

2.  A  normal  population  of  times  has  a  standard  deviation  0.104  sec.  A  random 
sample  of  12  items  from  the  population  has  a  mean  12.33  sec.  Calculate  90% 
confidence  limits  for  the  population  mean.  What  is  the  smallest  sample  size  we  should 
use  if  we  want  to  be  95  %  sure  that  the  sample  mean  will  not  differ  from  the  (unknown) 
population  mean  by  more  than  0.05  sec.  ? 

3.  A  group  of  120  freshmen  in  arts  at  a  large  university  take  an  achievement  test 
in  mathematics  and  obtain  a  mean  score  70  with  a  standard  deviation  14.  Another 
group  of  80  in  engineering  take  the  same  test  and  obtain  a  mean  score  75  with  a  standard 
deviation  12.  Is  the  difference  in  the  means  significant  at  the  1%  level?  Hint:  The 
samples  are  large  enough  for  the  populations  to  be  regarded  as  normal.  The  variance 
of  the  difference  of  the  means  is  the  sum  of  the  variances  of  the  two  means  separately. 
(Compare  §  5.15.)  As  an  estimate  of  the  population  variance  take  the  weighted  mean 
of  the  sample  variances,  weighted  according  to  the  sample  size  less  one.  Assume  both 
populations  of  freshmen  are  large  compared  with  the  sample  sizes. 

4.  If  400  eggs  are. selected  at  random  from  a  large  consignment  and  50  are  found 
to  be  bad,  what  are  the  approximate  99  %  confidence  limits  for  the  proportion  of  bad 
eggs  in  the  whole  consignment?  Calculate  also  the  more  exact  confidence  limits  for 
comparison. 

5.  A  physician  treats  20  patients  suffering  from  a  certain  disease  and  1 1  of  them  die. 
The  mortality  rate  for  this  disease,  based  on  thousands  of  cases,  is  42  %.  Is  the  physician's 
sample  significantly  different  from  the  population,  at  the  5%  level?  Hint:  Calculate 
the  probability  that  X  >  1 1 ,  assuming  normality. 

6.  In  order  to  test  the  efficacy  of  a  drug  said  to  prevent  sea-sickness,  25  men  who 
had  always  developed  symptoms  of  sickness  when  subjected  to  the  motion  of  a  rocking 
machine  were  given  the  drug.  On  a  further  trial  with  the  machine,  15  of  these  men  were 
found  to  be  immune  to  the  motion.  Find  95%  confidence  limits  for  the  proportion 
of  men  liable  to  seasickness  who  would  be  rendered  immune  by  taking  this  drug. 
(Assume  approximate  normality.) 

7.  In  a  poll  of  148  men  and  152  women  the  question  was  asked,  "Do  you  approve  of 
the  practice  of  tipping,  by  and  large?",  and  89  of  the  men  and  116  of  the  women 
answered  "yes."  Construct  approximate  95%  confidence  limits  for  the  difference 
between  the  proportion  of  "yes"  answers  in  the  male  population  sampled  and  that  in 
the  female  population  sampled.  (Both  populations  may  be  taken  as  large  compared 
with  the  samples.) 

8.  Random  samples  of  50  students  each  were  taken  from  (a)  a  freshman  class  in 
arts  and  science  numbering  248  and  (b)  a  freshman  class  in  engineering  numbering  187. 
Both  sample  groups  were  given  a  mathematical  aptitude  test,  and  the  numbers  reaching 
a  pass  standard  were  35  and  41  respectively.  Test  the  hypothesis  that  the  proportion 
of  passes  would  be  the  same  in  both  classes  if  all  members  were  tested.  Hint:  The  two 
populations  are  finite.  Use  Eq.  (5.16.1).  Calculate  the  probability  of  a  difference  as 
great  numerically  as  that  observed  if  the  stated  hypothesis  were  true. 

9.  A  research  worker  wishes  to  estimate  the  mean  of  a  population  using  a  random 
sample  so  large  that  the  probability  will  be  at  least  0.95  that  the  sample  mean  will  not 
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differ  from  the  population  mean  by  more  than  25  %  of  the  population  standard  devia- 
tion. How  large  should  the  sample  be? 

10.  Obtain  some  binomial  probability  paper  and  solve  Problems  C-5  and  C-6 
graphically. 
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Chapter  6 

ESTIMATION,   TESTING   AND   DECISION 
MAKING 

6.1  Maximum  Likelihood  Point  Estimation  A  very  useful  method  of  esti- 
mation, which  has  been  vigorously  promoted  by  Fisher,  is  the  method  of  maxi- 
mum likelihood.  The  general  idea  is  to  choose  as  estimator  of  a  parameter  0 
that  function  of  the  sample  observations  which  will,  when  substituted  for  0, 
make  the  probability  of  the  sample  a  maximum.  In  other  words,  for  this  value 
of  0  the  observed  sample  is  also  the  most  likely  sample. 

Consider,  for  instance,  a  binomial  population  with  parameter  0.  The 
variate  observed  is  the  number  of  successes  X  in  N  trials,  and  the  probability 
that  X  =  x  is 


»-G) 


N-x 


/(x|0)  =  ^j0*(l-0) 

As  a  function  of  0,  this  is  a  maximum  when  df/dO  =  0  and  d2f/d02  <  0. 
Since 

df/do  =  ^%x0*_1(i  -  ef-x-  (n  -  x)ox(i  -  of-*-1] 

=/(x|0)[x0"1-(iV-x)(l-0)-1] 

the  critical  value  0  of  0  is  given  by 

x0_1-(iV-x)(l-0)-1=O 
or 

0  =  * 

N 

It  is  easy  to  verify  that  this  value  does  indeed  correspond  to  a  maximum  for  / 
and  not  a  minimum.  The  maximum  likelihood  estimator  is  therefore  identical 
with  the  unbiased  estimator  used  for  0  in  Chapter  5. 

If  the  continuous  variate  X  has  a  probability  density  f(x\0,  0J  which 
depends  on  a  parameter  0  and  possibly  on  other  parameters  represented  jointly 
by  0a,  the  likelihood  of  a  set  of  sample  values  xx,  x2  . . .  xN  is  defined  by 

(6.1.1)  l  =/(x1|0,  ej  -  f(x2\o,  0j . .  .f(xN\e,  ej 

The  likelihood  is  therefore  a  joint  probability  density  for  the  whole  sample,  but 

123 
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not  a  probability.  When  X  is  discrete,  L  is  just  the  joint  probability  for  the 
observed  sample  (all  the  items  being  supposed  independently  selected  from  the 
same  population).  The  principle  of  maximum  likelihood  states  that  we  should 
choose  as  an  estimator  of  the  parameter  6  that  statistic  T  (if  it  exists)  which 
maximizes  L  for  variations  in  0,  whatever  the  values  of  the  other  parameters  0a 
may  be. 

In  practice,  the  logarithm  of  L  is  usually  more  convenient  than  L  itself. 
Since  log  L  is  a  monotone-increasing  function  of  L,  a  value  of  6  which  maximizes 
L  also  maximizes  log  L.  The  maximum  likelihood  estimator  is  therefore  given 
by  solving  for  9  the  equations 


d  d2 

^(logL)=0'      w 


(6.1.2)  -(log  L)  =  0,        -3(log  L)  <  0 


Example  1     For  a  normal  population,  with  parameters  ^,  cr,  the  density 
function  is 

/(x|/i!«T)=(2^)-1'2exp[-^^] 
Therefore,  for  a  sample  of  size  TV  with  values  xl9  x2  .  . .  xN, 

and 

(6.1.3)  log  L  =  -  j  log(27r)  -  N  log  <7  -  X  (Xi~J} 

Differentiating  with  respect  to  )U,  we  find 

d(log  L)     ^Xi-  n 

and 


or 


1 

dfi          t 

(72 

32(logL) 

-N 

V 

(T2 

The 

maximum 

likelihood  estimator  for  /< 
£  (*(-/»  = 

i 

is  therefore  /2, 
=  0 

=  m 

given 

by 

the  sample  mean.  This  result  is  independent  of  the  value  of  the  other  parameter  a. 
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If,  however,  we  try  to  find  an  estimator  for  a  in  the  same  way,  we  get 

fflogL)  _      N  |  yjXj-ti)2 
da  o        i        <t3 

and,  on  putting  this  equal  to  zero,  the  value  of  &  is  not  independent  of  /z.  The 
extraneous  parameter  in  such  a  case  is  often  called  a  nuisance  parameter.  There 
is  no  maximum  likelihood  estimator  for  a  by  itself,  but  it  is  possible  to  get  joint 
maximum  likelihood  estimators  for  \i  and  o  by  solving  together  the  two  equations, 

—  (log  L)  =  0        and        —  (log  L)  =  0 
djx  do 

These  give  ft  =  m,  a2  =  —  £  (x  -  m)2,  so  that  the  joint  maximum  likelihood 

estimators  for  \i  and  a2  are  the  sample  mean  (m)  and  the  sample  second  moment 
(m2).  It  may  be  noted  that,  although  the  former  is  unbiased,  the  latter  is  not, 
since,  as  we  have  already  found, 

a2 


£(m2)=(N-l)A/ 


and  not  a2  itself. 


6.2  Sufficient  Estimators  Some  characteristics  of  estimators  were  men- 
tioned in  §  5.6,  but  there  are  others  which  are  also  important.  A  statistic  T  is 
said  to  be  a  sufficient  estimator  of  0,  or,  in  Fisher's  terminology,  exhaustive,  if 
it  uses  all  the  relevant  information  in  the  sample.  If  the  likelihood  function  is 
expressed  in  the  form 

(6.2.1)  L=g(t\9)h(xux2...xN\t,9) 

or 

log  L  =  log  g  +  log  h 

where  g  is  the  density  function  for  T  and  h  is  the  conditional  density  function  for 
jq  .  .  .  xN,  given  that  T  =  /,  then  it  may  happen  that  h  does  not  depend  on  6. 
If  so,  T  is  a  sufficient  estimator  of  6. 

For  suppose  U  is  another  statistic  obtainable  from  the  observations.  The 
distribution  of  U  for  a  given  T  will  depend  upon  h,  but  since  h  does  not  involve 
6,  the  statistic  U  can  provide  no  information  about  9  which  is  not  already  given 
by  T. 

It  is  desirable  to  have  a  sufficient  estimator  where  possible,  since  then  we 
know  that  we  are  utilizing  all  the  information  about  9  that  we  can  get  from  the 
sample,  but  sufficiency  alone  does  not  define  a  statistic  very  precisely.  If  T  is 
sufficient,  so  is  a  function  of  T. 

Sufficient  statistics  exist  in  only  a  relatively  few  special  cases.  It  is  one  of  the 
merits  of  the  maximum  likelihood  method  of  estimation  that  if  a  sufficient 
statistic  does  exist  for  a  parameter  the  maximum  likelihood  estimator  is  sufficient. 
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In  Example  1  above,  the  sum  £f  (xt  —  fi)2  occurs  in  the  expression  for  log  L 
and  this  can  be  split  up  into  a  part  depending  on  m  and  \i  and  a  part  independent 
of /I.  Thus 

i  i 

=  Z(xi-m)2+N(m-n)2+2(m-ti)Zi(xl-m) 

i  i 

=  Nm2  +  N(m  -  /x)2 

since  £4  (xt  —  m)  =  0  from  the  definition  of  m.  Equation  (6.1.3)  can  therefore 

be  written 

N  t     „  x  ,  Nm2      N(m  -  ^)2 

log  L  =  --  log(27i)  -  N  log  (j  -  -j-l  -  -^2^" 

so  that,  apart  from  constants, 

log  g  =  --^  (m  -  n)2,        log  fi  =  -^2- 

It  is  clear  that  h  is  a  function  of  the  sample  values  not  depending  on  ^,  while  g 
is  a  function  of  the  estimator  m  and  of  \i.  Therefore,  m  is  a  sufficient  estimator 
for  \i. 

6.3  Properties  of  Maximum  Likelihood  Estimators  The  following  five 
properties  are  the  main  reasons  for  recommending  the  use  of  maximum  like- 
lihood (m.l.)  estimators : 

(a)  The  m.l.  estimator  is  consistent.  If  f(x\6)  is  continuous  in  x,  and  also 
continuous  and  monotonic  in  6  over  an  interval  including  the  true  value  0O,  and 
if  Tis  the  m.l.  estimator  of  6,  then  T  converges  in  probability  to  60  as  the  sample 
size  increases.  The  proof  holds  also  for  a  discrete  variate  if  we  replace  each 
value  by  an  interval  over  which  we  suppose  the  frequency  distributed  uniformly. 
Details  may  be  found  in  [1]. 

(b)  The  m.l.  estimator  tends  to  normality  as  TV  increases.  The  conditions  in 
(a)  are  supposed  to  hold,  together  with  some  further  conditions  on  the  continuity 
of  dfJdO. 

(c)  The  m.l.  estimator  is  most-efficient  (see  §  5.6).  The  variance  of  the  m.l. 
estimator  T  is  given  by 


where,  on  the  right-hand  side,  6  is  to  be  put  equal  to  0O.  If  the  domain  of /does 
not  depend  on  6,  Eq.  (1)  is  equivalent  to 

(6.3.2)  [K(T)]-  =  -OT(^) 
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Since  log  L  =  £,  log /(*;),  and  since  the  xt  all  have  the  same  distribution, 
Eq.  (2)  may  be  written 

This  is  a  convenient  way  of  finding  the  variance  of  a  m.l.  estimator. 

It  can  be  shown  that  if  U  is  any  estimator  for  0,  which  in  large  samples  is 
normally  distributed  about  60  with  variance  V(U),  and  if  the  domain  of/ does 
not  depend  on  0,  then 


(6.3.4)  [VWY1  <NE 


mi 


It  follows  from  Eq.  (1)  and  property  (b)  above  that  the  m.l.  estimator  is  most 
efficient  in  the  sense  described  in  §  5.6. 

(d)  If  a  sufficient  estimator  exists  for  0,  the  m.l.  estimator  is  sufficient. 
For,  by  §  6.2,  if  Tis  sufficient  and  g(t\6)  is  its  density  function, 

(6.3.5)  L=g(t\9)-h(xux2...xN\t) 

where  h  is  a  function  of  the  sample  values  which  does  not  depend  on  0.  Therefore, 

(6.3.6)  £<tojL)_ig_,KM 

The  m.l.  estimator  is  given  by  putting  \j/{Q,  t)  =  0  and  solving  for  0.  The 
result  is  obviously  a  function  of  t>  say  <j)(t).  The  estimator  is  therefore  (j>{T),  and 
since  T  is  sufficient,  so  is  4>(T). 

(e)  The  m.l.  estimator  is  invariant  under  functional  transformations.  This 
means  that  if  Tis  the  m.l.  estimator  of  0,  and  if  u(6)  is  a  function  of  0,  then  u(T) 
is  the  m.l.  estimator  for  u(6). 

If,  for  example,  we  are  dealing  with  a  normal  population  (for  which  fi4 
=  3<r4)  and  we  know  that  the  m.l.  estimator  for  a2  is  the  sample  second  moment 
m2  (that  is,  A/"-1  £  (x(  —  m)2),  we  conclude  that  the  m.l.  estimator  for  /z4  is 
3m22  and  not  m4.  Of  course,  m4  might  be  used  as  an  estimator,  but  it  would  not 
have  as  small  a  sampling  variance  as  3m22. 

This  property  of  invariance  is  not  true  of  all  estimators.  If  Tis  an  unbiased 
estimator  of  0,  for  example,  it  does  not  follow  that  T2  is  an  unbiased  estimator 
of02. 

Example  2  If  the  population  is  normal,  and  if  a  is  supposed  known,  the 
m.l.  estimator  of  ji  is  the  sample  mean  m,  as  shown  in  Example  1 . 


Since  log/  =  -  \  log(27ta2)  -  — ^  (x  -  ^)2,  we  have 

2a- 


1 

clogf  ^(x-  n) 

dfi  a2 

d2  log/  1_ 

dfi2  a2 
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(d2  log  A 


-«n    O 


—fdx=—2 


The  variance  of  m  is  therefore  <r2/N,  in  agreement  with  the  result  found  in 
Chapter  5. 

*  6.4  The  Cramer-Rao  Inequality     If  T  is  an  estimator  of  6,  and  if  its  bias  is  b9 
as  defined  by  the  relation 

(6.4.1)  E(T)  =  Q  +  b 

where  b  may  depend  upon  6,  then 

/1       dbY 
(1+W 


(6.4.2) 


e[(t  -  ey]  > 


N 


':.(*-¥)'*** 


If  #(f|0)  is  the  frequency  function  for  T,  the  equality  sign  in  relation  (2)  holds 
if,  and  only  if, 


(6.4.3) 


(a)     T  is  sufficient 


where  k  is  independent  of  T  but  may  depend  on  9. 

If  Tis  an  unbiased  estimator,  so  that  b  =  0,  £[(r  —  0)2]  becomes  the  variance 
of  T,  and  Eq.  (2)  is 


(6.4.4) 


V(T)> 


N 


:mH" 


which  is  formally  the  same  as  (6.3.4),  although  the  conditions  imposed  on  the 
estimator  are  different. 

The  relation  (4)  was  proved  independently  by  Cramer  and  Rao,  although  it 
was  found  earlier  by  Fisher  for  the  special  case  of  a  normal  population.  For  a 
proof  see  [2]. 

Example  3     For  the  two-parameter  gamma  distribution  of  Eq.  (4.4.6), 
f(x)  =  e~ xlpxa ~ ' [0T(a)] ~\         0  <  x  <  oo 
Suppose  that  a  is  known  and  we  wish  to  find  an  estimator  for  p.  We  have 


log/  =  —  +  (a  -  1)  log  x  -  a  log  0  -  log  T(a) 
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so  that 


Therefore, 


log  L  =  —  £  -£  —  Net  log  P  +  terms  independent  of)?. 


d(log  L)  _y  Xj      Net  = 


50  "^       f 

giving  as  the  m.l.  estimator 

^      ^  Na       a 
where  m  is  the  sample  mean. 

Since,  for  this  distribution  E(x)  =  a/?,  it  follows  that  E0)  =  p,  and  the 
estimator  p  is  therefore  unbiased.  The  variance  is  given  by 


[v(P)T=-n 


=  -N 


co    32 


d'Oog/) 


f  dx 


o        ^2 

'™/-2x    a\rJ 


o 

_  2/Va      Na_Na 

The  variance  of  P  is  therefore  p2/Na.   The  likelihood  function  may  be  written 

Na  ~ 
log  L  =  -—  0  -  ATa  log  p  +  (a  -  1)  £  log  x,  -  N  log  T(a) 

=  log  #  +  log  h 
where 

log  g  =  —  (Na  $)IP  —  Na  log  P  +  terms  independent  of  P 

and  log  h  is  independent  of  /?.  This  shows  that  /?  is  sufficient.  Also, 

d(log  g)  _  Nap      Na 


Na  .« 
1 


2W-P) 


which  is  of  the  form  of  Eq.  (6.4.3),  condition  (b).   The  two  conditions  for  the 
sign  of  equality  in  (6.4.4)  are  therefore  satisfied. 

*  6.5   Approximate    Calculation    of    a  Maximum    Likelihood    Estimator     It 

happens  sometimes  that  the  method  of  maximum  likelihood  leads  to  equations 
which  are  very  troublesome  to  solve.  In  such  a  case  it  may  be  useful  to  find  a 
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simpler,  but  less  efficient,  estimator  and  apply  a  correction  to  bring  it  nearer 
to  the  desired  form. 

If  T  is  the  m.l.  estimator  and  U  is  an  estimator  which  is  not  quite  as  efficient 
as  T,  we  know  that  (d  log L/dO)T  =  0,  and  E{d2  log L/d02)T  =  -[V(T)]~\  at 
least  for  large  N. 

By  Taylor's  theorem, 

+  terms  of  higher  order. 

Since  we  suppose  that  the  quantity  U  —  T  is  small  and  since  we  can  approxi- 
mately replace  (d2  log  L/d62)T  by  its  expected  value,  we  have,  on  neglecting  the 
higher  terms, 


(6.5.1)  TkU  +  V(T) 


Id  log  L\ 
\    dO     )v 


The  last  term  on  the  right-hand  side  is  a  correction  to  be  applied  to  U  to  bring 
it  nearer  to  T.  The  value  of  V(T)  is  obtained  from  Eq.  (6.3.2). 


f(x)  =Z'~r~~s ^2'      —  oo  <  x  <  oo 


Example  4    For  the  Cauchy  distribution — 

1  1_ 

te'i  +(x-0)2' 

— the  sample  mean  is  not  a  good  estimator  of  6,  since  it  is  no  better  than  a  single 
observation.  The  sample  median  may  be  used  and  in  large  samples  has  a  variance 
n2/4N.  The  m.l.  estimator  is  given  by 

log  L  =  -  X  log[l  +  (x,  -  0)2]  -  N  log  7i. 


d(logL)        _     (Xi-6) 


Yl+(xf 


69  ti+(x-ff) 


0 


As  an  equation  in  6,  this  gives  a  polynomial  of  degree  2N  —  1,  which  even  for 
fairly  small  N  is  difficult  to  solve.  From  Eq.  (6.3.2), 

_2Nr°°        (x-6)2-l 

~Tj_„[i+(x-0)2;rx 

_4N  f00    u2-\ 
~   n  Jo   (l+«2) 


^du 


_N_ 
~  ~2~ 
so  that  V(T)  =  2/ N. 
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The  efficiency  of  the  median  is  therefore  (2/N)  -f-  (n2/4N)  =  S/n2,  or  about  0.8. 
The  improved  estimator  will  be  the  median  U  plus  the  correcting  term 

viTmogLiSe)v,-,,,^+l-_vu)2- 

6.6  Tests  of  Hypotheses  There  is  a  large  class  of  statistical  problems  con- 
cerned with  testing  whether  or  not  some  hypothesis  is  true.  For  example,  a 
machine  is  turning  out  thread,  and  we  would  like  to  be  reasonably  sure  that  the 
breaking  strength  is  at  least  100  lb  weight;  if  not,  the  machine  may  have  to  be 
re-adjusted.  We  can  take  samples  of  the  thread  at  intervals  and  test  them,  but 
because  of  the  variability  of  the  product  (inherent  in  the  process  of  manufacture) 
the  samples  will  vary  among  themselves.  We  can,  however,  use  them  to  test  the 
hypothesis  (H0)  that  the  mean  breaking  strength  (^)  of  the  thread  produced 
is  at  least  100  lb  wt  against  the  alternative  hypothesis  (//j)  that  \i  is  less  than 
100  lb  wt. 

The  hypothesis  which  we  set  up  and  proceed  to  test  by  experiment  is  called  a 
null  hypothesis.  On  the  basis  of  the  sample  we  can  take  various  possible  actions. 
We  can  (1)  reject  the  null  hypothesis,  which  in  this  example  may  mean  dis- 
mantling the  machine,  (2)  accept  the  null  hypothesis,  which  means  that  we 
happily  accept  the  product  of  the  machine  as  up  to  standard,  (3)  declare  that 
further  experimentation  is  necessary  before  we  can  make  a  decision.  If  the  size 
of  the  sample  is  fixed  beforehand,  this  third  procedure  is  not  open  to  us,  but 
in  sequential  sampling,  as  we  shall  see  later,  tests  are  continued  until  we  feel 
justified  in  taking  either  action  (1)  or  action  (2). 

In  taking  such  an  action  on  the  basis  of  a  sample  we  run  a  risk  of  doing  the 
wrong  thing.  Obviously  we  may  commit  either  of  two  kinds  of  error :  we  may 
reject  a  hypothesis  which  is  really  true  (this  will  be  called  a  rejection  error,  or 
an  error  of  the  first  kind),  or  we  may  accept  a  hypothesis  which  is  really  false 
(this  will  be  called  an  acceptance  error,  or  an  error  of  the  second  kind). 

Tests  are  usually  made  by  computing  some  statistic  (e.g.,  an  arithmetic  mean 
or  a  variance)  from  the  observations  and  noting  whether  or  not  this  computed 
value  lies  in  some  particular  interval,  or  set  of  intervals,  previously  chosen  on  the 
axis  of  real  numbers.  The  part  of  the  real  axis  so  chosen  is  called  the  region  of 
rejection,  and  the  hypothesis  H0  is  rejected  if  the  computed  value  lies  in  this 
region.  Thus,  if  the  population  is  known  to  be  normally  distributed  about  a 
value  \i  with  unit  variance,  and  if  H0  is  the  hypothesis  that  ft  is  zero,  we  shall  be 
inclined  to  reject  this  hypothesis  if  a  sample  of  TV  observations  gives  a  mean  too 
far  from  zero.  If  H0  were  true,  the  sample  mean  would  depart  from  zero  by  as 
much  as  1 .96  N~ 1/2  in  only  5  %  of  random  samples  of  size  N.  By  taking  as  our 
region  of  rejection  that  part  of  the  real  axis  outside  the  bounds  ±\.96N~1/2,  we 
run  a  risk  of  wrongly  rejecting  H0,  but  the  chance  of  doing  so  is  only  0.05.  By 
suitably  choosing  the  region  of  rejection  we  can  make  this  chance  what  we  like, 
depending  on  the  circumstances  of  the  problem  and  the  consequences  of  making 
a  wrong  decision. 


132 


INTRODUCTION  TO  STATISTICAL  INFERENCE 


6.8 


In  some  types  of  problem  the  test  statistic  may  be  a  pair  of  numbers,  and 
then  the  region  of  rejection  may  be  represented  by  a  part  of  the  x-y  plane.  The 
concept  can  obviously  be  extended  to  three  or  more  dimensions.  However,  in 
most  of  the  day-to-day  problems  of  practical  statistics  the  statistic  used  is  one- 
dimensional  and  the  region  of  rejection  is  an  interval  or  pair  of  intervals  on 
the  real  axis. 

6.7  Simple  and  Composite  Hypotheses  An  hypothesis  which  is  equivalent 
to  a  complete  specification  of  the  distribution  is  said  to  be  simple.  Otherwise,  it 
is  composite.  Thus,  if  a  population  is  known  to  be  normal  and  to  have  variance 
<j2,  the  hypothesis  that  the  mean  is  /i0  *s  a  simple  one,  since  the  mean  and 
variance  together  specify  a  normal  distribution  completely.  The  alternative 
hypothesis  could  be  simple  also — if  it  were  known,  for  instance,  that  the  popu- 
lation mean  must  be  either  fi0  or  nx  and  could  not  have  any  other  value.  More 
usually,  the  alternative  would  be  composite.  It  could  be  a  two-sided  alternative 
(namely,  that  \i  is  either  less  than  or  greater  than  /z0),  or  it  could  be  a  one-sided 
alternative  (that  \i  >  /z0,  for  instance,  supposing  that  we  have  good  reason  to 
believe  that  it  cannot  possibly  be  less).  We  may,  for  example,  want  to  know 
whether  a  new  kind  of  fertilizer,  applied  in  a  particular  way,  will  increase  the 
yield  of  a  crop,  but  feel  quite  certain  that  it  will  not  actually  diminish  the  yield. 
It  would  be  reasonable  in  this  case  to  use  a  one-sided  alternative. 


Fig.  33    Errors  of  the  first  and  second  kind 


6.8  The  Size  and  Power  of  a  Test  Suppose  we  want  to  test  the  simple  null 
hypothesis  (H0),  that  9  =  0O,  against  the  simple  alternative  hypothesis  (Hi),  that 
Q  =  el9  by  means  of  a  test  statistic  T.  In  order  to  fix  our  region  of  rejection  we 
shall  need  to  know  the  density  function  of  T  when  9  =  0O,  say  g(t\90).  In 
general,  g  will  depend  on  the  sample  size  N9  and  the  region  of  rejection  (R)  will 
be  an  interval  on  the  f-axis  (e.g.,  the  interval  t  >  ta  in  Figure  33). 

If  the  probability  that  T  falls  in  R,  when  H0  is  true,  is  a,  then  a  is  the  proba- 
bility of  committing  a  rejection  error  (that  is,  of  rejecting  H0  when  it  is  true). 
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This  probability  is  often  called  the  size  of  the  test.  It  is  given  by 


133 


(6.8.1) 


J(R) 


a  =        g(t\60)  dt 


and  is  represented  by  the  heavily-shaded  area  in  Figure  33.  In  practice,  R  is 
usually  chosen  so  that  a  is  0.05  or  0.01,  although  of  course  any  convenient  size 
can  be  used. 

If  the  statistic  T  is  discrete,  the  integral  will  be  replaced  by  a  sum,  and  it  will 
not  generally  be  possible  to  make  the  size  exactly  0.05  or  other  preassigned  value. 
The  region  of  rejection  will  be  a  set  of  values  of  T,  the  probabilities  of  which 
add  up  to  something  near  the  required  a. 


P(6) 


O 


+-$ 


Fig.  34    Power  function  of  an  ideal  test 


If  the  alternative  hypothesis  H^  is  true,  there  will  be  a  different  distribution 
of  r,  with  density  g{t\Ox).  The  probability  of  committing  an  acceptance  error 
(that  is,  of  accepting  H0  if  Hl  is  true)  is  given  by 


(6.8.2) 


P 


=  f    g{t\0,)dt 

J  (A) 


where  A  is  the  region  of  acceptance  (all  possible  values  of  t  outside  of  R).  This 
probability  is  represented  by  the  lightly-shaded  area  in  Figure  33.  The  power  of 
the  test  is  defined  by 


(6.8.3) 


W  =  l  -  p 


f  git\ex)dt 


It  is  the  probability  of  rejecting  H0  if  Hx  is  true  (that  is,  if  H0  should  be  rejected). 
Obviously,  we  would  like  a  test  to  be  as  powerful  as  possible  for  the  same  size. 
The  power  depends,  of  course,  on  0X .  If  6  is  any  value  of  0t ,  P(6)  is  called  the 
power  function  of  the  test  for  60  against  6.  If  6  is  near  to  0O,  the  power  will 
usually  be  small,  and  if  6  =  60  it  becomes  equal  to  a.  For  6  far  removed  from  60 
the  power  will  usually  be  near  to  1,  since  any  reasonable  test  should  be  able  to 
decide  between  very  different  hypotheses.  The  ideal  power  function  would  be 
something  like  the  one  sketched  in  Figure  34,  in  which  a  =  0  and  P(6)  =  1  for 
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all  6  not  equal  to  0O.  Both  kinds  of  error  would  then  be  zero,  but  such  ideal 
tests  are  not  available  in  practice. 

An  actual  power  function  will  be  more  like  that  depicted  in  Fig.  35.  A 
low  power  can  be  tolerated  for  6  near  0O,  since  no  great  harm  is  done  if  we  do 
make  the  mistake  of  accepting  6  instead  of  60.  Where  6  differs  considerably  from 
0O,  so  that  it  might  be  a  serious  matter  to  mistake  one  for  the  other,  the  power 
is  near  1  and  the  acceptance  error  p  is  small. 

The  function  fi(6)  =  1  —  P(6)  is  often  used  instead  of  the  power  function, 
particularly  in  industrial  practice,  and  is  called  the  operating  characteristic 
(O.C.)  of  the  test.  The  graph  of  the  O.C.  is  like  that  of  the  power  function 
turned  upside-down,  with  0  and  1  interchanged. 


Fig.  35    Power  functions  of  alternative  tests 


The  two  types  of  error  that  we  have  defined  depend  on  conditional  probabili- 
ties, the  probability  that  T  falls  in  R  when  H0  is  true  (denoted  by  Te  R\H0) 
or  the  probability  that  T  falls  in  A  when  Hx  is  true  (denoted  by  Te  A \Hl).  If  we 
assert,  on  the  basis  of  our  observations,  that  H0  is  not  true,  the  chance  that  we 
are  wrong  depends  not  only  on  these  conditional  probabilities  but  also  on  the 
prior  probability  (previous  to  our  observations)  that  H0  actually  is  true.  If 
p0  is  the  prior  probability  of  //0,  it  follows  from  the  rules  for  probability  calcu- 
lations in  Chapter  1  that  the  chance  of  being  wrong  in  rejecting  H0  is  given  by 

P(T  e  R\H0)-p0=oip0 

and  the  chance  of  being  wrong  in  accepting  H0  is 

P(TeA\Hl)-(l-Po)=P(l-p0) 

where  A  is  the  region  of  acceptance  (the  whole  domain  of  T  outside  of  R).  If 
the  null  hypothesis  H0  that  we  choose  to  test  is  one  that  has  a  small  prior 
probability  of  being  true,  the  chance  of  being  wrong  in  rejecting  it  may  be  much 
less  than  the  size  of  the  test  a. 

It  is  often  possible  to  choose  the  region  of  rejection  R  in  different  ways,  even 
though  the  size  a  remains  constant.  Each  choice  of  R  will  give  rise  to  a  different 
power  function.   Figure  35  illustrates  the  possibility  that  the  test  using  a  region 
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R'  may  be  more  powerful  than  the  test  using  R  for  all  values  of  01  >  0O,  but  may 
be  less  powerful  when  9X  <  90. 

If  for  every  value  of  0,  except  0O,  the  power  curve  of  R'  is  below  that  of  R,  the 
test  using  R  is  said  to  be  uniformly  more  powerful  than  that  using  R'.  If  this 
holds  for  all  possible  choices  of  R'  then  the  test  using  R  is  a  uniformly  most 
powerful  test.  In  a  few  cases  such  tests  have  been  found. 

6.9  The  Neyman-Pearson  Theorem  Let  X  be  a  random  variable  with  density 
function  f(x).  (If  X  is  discrete,  the  necessary  modifications  in  the  proof  can 
easily  be  made).  We  suppose  that/(x)  depends  on  a  parameter  0  for  which  we 
would  like  to  test  the  simple  hypothesis  H0  (that  0  =  0O)  against  the  simple 
hypothesis  H1  (that  0  =  0^.  The  test  consists  of  rejecting  H0  if  the  observed 
value  of  X  lies  in  a  region  R  and  accepting  H0  otherwise.  The  size  of  the  test  is 


=  f    f(x\0o) 

J(R) 


(6.9.1)  a=|     f(x\90)dx 
and  the  power  is 

(6.9.2)  P  =  f    f(x\9t)  dx 

J(R) 

Suppose  now  that  R'  is  any  other  region  of  the  domain  of  X  for  which 

(6.9.3)  f(x\90)  dx<OL 

J(K') 

If  for  every  such  R'  it  is  true  that 
(6.9  4)  \    f{x\0,)dx<  [   f{x\Bx)dx 

J(R')  J(R) 

then  R  is  a  most-powerful  test,  of  size  not  greater  than  a,  for  testing  0t  against 
0O.  Neyman  and  Pearson  [3]  proved  that  if  a  region  R  exists  satisfying  (1)  and 
such  that  x  belongs  to  R  whenever 

(6-95)  K$)<c 

where  c  is  some  constant,  and  does  not  belong  to  R  whenever  this  ratio  >  c, 
then  R  is  a  most-powerful  test  of  size  not  greater  than  a.  The  ratio  in  (5)  is  called 
the  likelihood  ratio,  and  will  be  denoted  by  L(x). 

As  well  as  merely  distinguishing  between  two  fixed  values  90  and  0U  the  likeli- 
hood ratio  test  applies  more  generally.  Thus,  suppose  the  possible  values  of 
9  form  a  set  Q  (which  may,  for  example,  be  the  interval  0  to  1,  or  the  interval 
—  oo  to  oo).  The  null  hypothesis  H0  may  specify  that  9  belongs  to  some  subset 
a>  of  Q  (for  instance,  the  single  value  0.5,  or  the  interval  from  0.4  to  0.6)  and  Ht 
is  then  the  hypothesis  that  9  belong  to  Q  —  co.  The  likelihood  ratio  is  defined  as 
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the  ratio  of  the  maximum  likelihood  under  H0  to  the  maximum  likelihood  under 

max/(x|0) 

(6.9.6)  L(x)  =    6e<a 

v  max  f(x\0) 

6eSl-co 

If  L(x)  is  small,  the  observed  x  will  be  more  likely  under  H^  than  under 
H0,  so  that  it  would  be  unreasonable  to  maintain  H0.  The  test  consists  in 
rejecting  H0  when  L(x)  <  c,  c  being  such  that 

(6.9.7)  P(L(x)  <  c\H0)  =  a 

Many  useful  tests  in  statistics  are  likelihood  ratio  tests.  The  statistic  X 
may  consist  of  a  set  of  N  independent  observations,  forming  a  random  sample, 
and  the  likelihood  will  then  be  a  joint  probability  density  for  the  N  observations. 
When  N  is  large,  the  distribution  of  —  2  log  L{ X)  under  H0  is  approximately  a 
chi-square  distribution  with  degrees  of  freedom  depending  on  the  number  of 
parameters  concerned  (one,  in  the  case  discussed  above).  This  was  shown  by 
Wilks  [4].  If  Hx  is  true,  the  distribution  of  -2  log  L(X)  is  approximately  non- 
central  chi-square  (see  Appendix  A.  13).  Tables  of  the  non-central  chi-square 
distribution  may  be  found  in  [5]. 

When  the  parent  population  is  normal,  as  we  shall  see  in  the  next  section,  the 
chi-square  distribution  of  —2  logL(A')  holds  exactly,  even  for  N  =  1. 

6.10  The  One-Sided  Normal  Test  Suppose  the  population  is  normal,  with 
known  variance  a2,  and  suppose  we  wish  to  distinguish  between  two  possible 
values  of  the  mean,  /z0  and  pu  where  Mi  >  fi0. 

The  test  statistic  is  the  sample  mean  m  computed  from  N  observations.  Since 
m  is  normally  distributed  with  mean  \i  (\i  is  either  n0  or  /zx)  and  variance  <r2/N, 

(6.10.1)  /(m|/i)  =  (^)  We-*o-W 
The  likelihood  ratio  is 

(6.10.2)  L(m)  =  TT^pj  =  exp{^2  K™  "  A*i)2  -  (m  -  /z0)2] 

=  expU  (Mi  ~  MoX^i  +  Mo  -  2m) 

This  will  be  less  than  some  positive  constant  c  if  ^  +  fi0  -  2m  <  cu  where 
Cj  is  another  constant  depending  on  c  and  on  the  known  quantities  pu  n0,  a2 

and  N.  Actually  ct  =  log  c     — jfc  -  Ho)  .  The  relation  fit  +  n0  -  2m  <  ct 

implies 

Mi+Mo-Ci 
m  >  c2,        c2  = 
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so  that  the  test  reducesr  to  rejecting  H0  (that  \i  =  fi0)  if  the  sample  mean  is 
greater  than  a  certain  value,  c2.  This  value  can  be  determined  by  deciding  on  the 
size  of  the  test. 

If  a  is  the  probability  that  m  >  c2,  given  that  //  =  ^0, 

(6.10.3)  a  =  I    f(m\ii0)  dm 

Jc2 

By  the  transformation  (m  —  fi0)N1/2/a  =  v,  this  becomes 


oc=(27r)-1/2  I 

./  v 


'00 

(6.10.4)  a=(27c)-1/2  |     e~v2/2  dv 

vo 

=  i  -  *(»o) 

where  v0  =  (c2  —  h0)N1/2/(t  and  Q>(v)  is  the  cumulative  distribution  function  for 
the  normal  law. 

If  we  choose  a  =  0.05,  we  find  from  the  tables  of  the  normal  law  that 
v0  =  1.645,  so  that 

(6.10.5)  c2=/i0  +  1.645<jAr1/2 

The  test  therefore  consists  in  rejecting  H0  in  favor  of  Hl  if  m  >  fi0  +  l.645aN~1/2 
The  power  of  this  test  is  given  by 


-P 

J  C2 


(6.10.6)  P  =      /(m|/0  dm 

J  c2 

=  1  -  ©fo) 

where  ^  =  (c2  -  tit)N1/2la  =  1.645  -  (^  -  ii0)Nll2lo.  Thus,  if  ^x  -  ^0 
=  0.3d  and  N  =  9,  the  power  is  1  -  0(0.745)  =  0.228.  With  a  sample  of  size  9 
there  is  therefore  a  probability  equal  to  0.228  of  detecting  a  difference  /^  —  ^0 
as  great  as  0.3  of  the  standard  deviation,  if  it  is  known  that  this  difference  is 
positive.  This  is  a  one-sided  test. 

It  may  be  noted  that  this  test  can  be  regarded  as  a  test  of  the  simple  hypothesis 
H0  against  the  composite  alternative  Hi  (that  \i  >  /i0).  The  power  is  then  a 
function  of  \x.  The  set  Q  of  possible  values  of  \i  is  the  set  of  real  numbers  >  /i0, 
and  the  set  co  consists  of  the  single  number  (i0. 

The  likelihood  ratio  is 


(6.10.7)  L(m) 


max/(m|ju) 


where  /(m|/z)  is  given  by  Eq.  (1).  This  density,  f(m\fi),  is  a  maximum  for 
variations  in  ft  when  the  exponential  factor  is  equal  to  1,  that  is,  when  fi  =  m. 
If  the  sample  mean  m  should  happen  to  be  less  than  /i0,  the  maximum  would  be 
when  fi  is  arbitrarily  near  to  /z0.  Therefore,  Urn)  =  1  if  m  <  /i0,  but  if  m  >  ^j, 

(6.10.8)  L(m)  =  e-»(>»-»o)*i2«i 
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This  is  less  than  c  if  m  —  fi0  >  cl9  and  therefore  if  m  >  c2.  The  number  c2  is 
determined  by 

(6.10.9)  a  =      /(m|ju0)  dm 

=  1  -  O  (v0) 
with  v0  =  (c2  —  HoW1/2lff' 

The  test  consists  in  rejecting  H0  if  m  >  c2.  If  m  should  be  less  than  /z0,  H0 
will  naturally  be  accepted.  The  power  is  given  by 


=  \  7(m|/i) 

J  C2 


(6.10.10)  P(fi)=      f(m\ii)dm 

J  C2 
=    1    -   0(l>) 

with  v  =  (c2  —  fi)Nl/2/(T.  Since  —  21ogL(m),  from  Eq.  (8),  is  equal  to 
N(m  —  ixq)2I<j2,  which  on  hypothesis  H0  is  the  square  of  a  standard  normal  variate, 
it  follows  that  in  this  example  —  2  log  L(m)  has  a  chi-square  distribution  with 
one  degree  of  freedom,  regardless  of  the  value  of  TV. 

6. 1 1  The  Two-Sided  Normal  Test  If  the  null  hypothesis  H0  is  that  jj,  =  fi0 
(for  a  normal  parent  population  with  variance  a2),  and  the  two-sided  alternative 
Hl  is  that  either  \i  >  /x0  or  \i  <  fi0,  we  shall  have,  for  a  test  based  on  the  sample 
mean, 

(6.11.1)  L(m)=e-N(m-"0)2/2ff2 

which  is  less  than  c  if  \m  -  n0\  >  cv 

The  region  of  rejection  therefore  consists  of  two  parts — from  —  oo  to 
ix0  —  cx  and  from  fi0  +  c1  to  oo.  For  a  given  a,  cx  is  fixed  by  the  relation 

(6.11.2)  f(m\t*o)  dm  +  f(m\^0)  dm  =  oc 

J    -OO  J/iO  +  Cl 

Because  of  the  symmetry  of  the  normal  distribution,  both  integrals  are  equal  to 
a/2,  and  as  in  §  6.10  we  find 

(6.11.3)  a/2  =  1  -  O(u),        v  =  cxNl/2la 
For  a  =  0.05,  cx  =  \.96gN~112.  The  power  is  given  by 

(Vo  +  ci 

(6.11.4)  P(/*)  =  l-  /(m|/i)dm 

J  Ho-ci 

=  1  -  O  (uj)  +  0>  (vQ) 
where 

»i  ■-  —  (Mo  +  ct  -  m)  =  1-96  -  Nl/2a'lQi  -  n>) 
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and 

N1/2 

v0= G*o  -  ct  -  H)  =  -1.96  -  Nxi2o-\ii  -  ju0) 

a 

Example  5  What  size  of  sample  is  necessary  in  order  to  detect  with  proba- 
bility 0.8  a  difference  between  the  population  mean  and  the  assumed  value  /i0 
amounting  to  as  little  as  0.2(7,  given  that  the  probability  of  rejection  error  (of 
stating  that  a  difference  exists  when  in  fact  it  does  not)  is  0.05  ? 

Here^  =  1.9&ri\r1/2, 

P00  =  1  -  OO^)  +  O(t>0)  =  0.8 
with 

Vl  =  1.96  ±  0.2N1/2 


v0  =  -1.96  + 0.2  N 


1/2 


(The  two  plus  signs,  or  the  two  minus  signs,  go  together.)  For  fairly  large  N 
(taking  the  plus  signs),  0(ux)  «  1,  <&(v0)  «  0.8,  so  that  v0  «  0.8416,  giving 
TV  =  196.  With  this  value,  yx  =  4.76,  and  OO^)  is  certainly  close  enough  to  1. 
A  sample  size  of  196  will  therefore  give  the  required  power.  The  same  result 
follows  if  we  use  the  minus  signs,  with  O(t?0)  «  0,<D(v1)  «  0.2,  and  ^  »  —0.8416. 

*  6.12  The  Randomized  Neyman-Pearson  Theorem  It  is  possible  to  increase 
the  power  of  a  test,  in  certain  circumstances,  by  allowing  a  randomized  decision. 
The  total  domain  of  the  statistic  X  is  divided  into  three  parts :  R,  A,  and  D.  If 
the  observed  x  falls  in  i?,  H0  is  rejected,  and  if  it  falls  in  A,  H0  is  accepted,  but 
there  is  also  a  doubtful  region  D.  If  x  falls  in  D  we  toss  a  coin  or  draw  a  card  or 
consult  a  table  of  random  numbers — that  is,  we  employ  some  randomizing 
procedure  which  gives  us  a  known  probability  of  rejecting  H0. 

We  can  define  a  test  function  \j/(x)  by  letting  \j/{x)  =  1  if  x  e  R,  \j/(x)  =  0  if 
x  e  A  and  \j/{x)  =  ij/0  if  x  e  Z),  i/^(jc)  being  in  all  cases  the  probability  of  rejection 
of  #0  and  ^o  being  a  number  between  0  and  1 . 

The  randomized  Neyman-Pearson  theorem  states  that  if  L(x)  is  defined  as  in 
(6.9.5),  and  if 

(ij/(x)  =  1  when  L(x)  <  c 
ij,(x)  =  0  when  L(x)  >  c 
i/f(x)  =  \l/0  when    L(x)  =  c 

then  the  test  with  test-function  i//(x)  is  most  powerful  of  size  a  for  testing  H0 
against  Hv  The  value  of  c  is  determined  by 

(6.12.2)  P(L(X)  <  c\H0)  <  a 
and  the  value  of  \j/0  by 

(6.12.3)  P(L(X)  <  c\H0)  +  il/0P(L(X)  =  c\H0)  =  a 
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If  the  region  D  includes  only  a  single  value  of  x,  \j/Q  is  uniquely  determined,  but 
in  other  cases  several  values  of  \j/0  may  be  found  to  satisfy  this  equation. 

Example  6  Suppose  we  want  to  test  the  hypothesis  that  the  proportion  of 
defectives  in  a  large  lot  of  manufactured  articles  is  not  more  than  10%,  and  we 
decide  to  do  so  by  taking  a  sample  of  four  items  and  noting  the  number  (X)  of 
defectives.  Clearly  X  can  take  only  the  values  0,  1,  2,  3  or  4,  and  the  larger  X  is, 
the  more  readily  we  shall  reject  the  hypothesis. 

The  probability  of  exactly  x  defectives  is 


m:} 


(6.12.4)  f(x\0)  =  r\Qx(l-0?-x 

If  0  =  0.10,  this  expression,  for  x  =  2,  3,  4,  takes  values  0.0486,  0.0036  and 
0.0001,  respectively.  If  0  <  0.10,  these  values  will  be  still  smaller.  We  might 
therefore  take  as  the  region  of  rejection  the  set  of  values  x  =  3  and  x  =  4,  and 
the  rejection  error  will  be 

(6.12.5)  I/(*|0)  <  0.0037,        0  <  0.10 

(R) 

If,  however,  we  include  x  =  2  in  R,  we  have 

£/(x|0)  <  0.0523,        0<O.1O 

(R) 

and,  at  least  for  some  values  of  0,  the  size  of  the  test  will  be  greater  than  0.05. 
The  non-randomized  test  would  therefore  tell  us  to  reject  H0  if  X  =  3  or  4,  and 
accept  H0  iUX  =  0,  1  or  2.  The  power  of  this  test  is 

(6.12.6)  P(0)  =  £/(x|0)  =  04  +  403(1  -  0) 

x  =  3 

For  0  =  0.20,  this  is  0.027. 

Suppose  now  we  use  a  randomized  test,  and  decide  to  reject  H0  with  proba- 
bility i//0  when  X  =  2.  The  probability  of  this  under  H0  is  0.0486,  so  that 
Eq.  (3)  gives,  for  a  =  0.05, 

(6.12.7)  0.0037  +  0.0486  \j/0  =  0.05 
Therefore,  \j/0  =  0.95.  The  randomized  test  is: 

reject  H0  if  X    =  3  or  4 

accept  H 0  if  X  =  0  or  1 

reject  Hfi  with  probability  0.95  if  X  =  2 

A  way  of  rejecting  H0  with  probability  0.95  would  be  to  use  a  table  of  random 
two-digit  numbers.  Before  opening  the  table,  decide  arbitrarily  on  a  particular 
page,  a  particular  column,  and  a  particular  position  in  the  column  (say  seventh 
from  the  top).  Then  look  up  the  number.  If  it  lies  between  00  and  94,  inclusively, 
reject  H0. 
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The  power  of  this  randomized  test  is 
(6.12.8)  m  -  if(x\6)  +  0.95f(2\0) 

3 

=  04  +  403(1  -  0)  +  5.7O02(1  -  0)2 

For  0  =  0.20  this  is  0.173,  greater  than  for  the  non-randomized  test. 

In  the  above  example  we  did  not  need  to  find  the  likelihood  ratio,  but  we 

can  easily  do  so.   The  maximum  of  Eq.  (4)  under  H0  is  P  J(0.10)x(0.90)4~x  if 
x  >  0,  or  1  if  x  =  0,  and  the  maximum  under  H1  is  I    J I  -  I  1 1  -  -J        I  given 

by  0  =  -  j  if  x  >  0,  or  (0.90)4  when  x  =  0.    Therefore  L(x)  =  (10/9)4  when 

/0.40W  3.60  \4_x 
x  =  0  and  L(x)  =  I 1  I 1        when  x  >  0.  For  x  =  2,  this  is  0.1296, 

which  is  the  c  of  Eqs.  (2)  and  (3).  The  probability  that  L(X)  =  c  is  the  same  as 
the  probability  that  X  =  2. 

Example  7  Suppose  the  null  hypothesis  H0  is  that  X  is  a  random  variable 
with  a  rectangular  distribution  of  mean  2  and  range  2,  and  the  alternative  H x  is 
that  X  has  a  rectangular  distribution  of  mean  4  and  range  4.  It  is  clear  from 
Figure  36  that  H0  must  be  accepted  when   1  <  x  <  2  and  rejected  when 

t 

fix) 


Fig.  36    Randomized  test 

3  <  x  <  6.  The  only  doubt  arises  where  the  two  distributions  overlap,  for 
2  <  x  <  3  (the  region  Z)).  Evidently,  L(x)  =  oo,  2,  0  for  the  regions  A,  D  and 
7?.  If  in  the  region  D  the  probability  of  rejection  is  \j/0i  we  have 

a  =  P(3  <  x  <  6\HQ)  +  \j/0P{2  <x<  3|H0) 
=  0  +  ^0(l/2) 
If  a  =  0.05,  i^o  =  0.10.  The  power  of  the  test  is 

P  =  P(3  <  x  <  6|/f  x)  +  ^o^(2  <  x  <  3\Ht) 
=  i+0.1(i)=g 
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The  value  of^o  is  here  not  unique.  Ifwetakei^0(x)  =  (x  —  2)/5,  2  <  x  <  3,  this 
will  give  the  same  size  and  power  as  \j/0  =  0. 10.  The  error  a  is  now  determined  by 


{>„< 


P(3  <  x  <  6|tf  0)  +  I    Mx)P(2  <*<  3|#  0  dx 

2  10 

6. 13  Statistical  Decisions  and  Risk  The  practical  problem  of  the  statistician 
is  usually  that  of  making  a  decision  in  the  face  of  uncertainty.  The  problem  may 
be  one  of  deciding  on  the  best  value  to  use  for  some  characteristic  of  a  popu- 
lation, such  as  the  variance,  or  it  may  involve  deciding  between  alternative 
hypotheses.  The  decision  may  require  action,  for  example,  accepting  or  rejecting 
a  lot  of  manufactured  articles  after  having  inspected  a  random  sample,  or 
recommending  the  use  of  a  particular  fertilizer  for  increasing  the  expected  yield 
of  a  crop,  after  analysing  some  experimental  results.  We  should  like  to  have  a 
sound  guiding  principle  for  use  in  making  such  decisions,  but  we  must  not  expect 
too  much  from  any  single  principle.  The  circumstances  of  a  particular  problem 
will  be  all-important. 

There  are  two  general  decision  principles  which  have  been  quite  widely 
used,  one  associated  with  the  names  of  Bayes  and  Laplace,  the  other  due  to 
Abraham  Wald,  although  these  are  by  no  means  the  only  possibilities.  The 
Bayes  rule  is  to  choose  that  course  of  action  which  has  the  largest  expectation  of 
gain  (or,  which  comes  to  the  same  thing,  the  smallest  expectation  of  loss).  This 
rule  assumes  that  we  know,  or  can  estimate,  the  prior  probabilities  of  the 
various  possible  situations  with  which  we  may  be  faced.  The  Wald,  or  minimax, 
principle  is  to  choose  that  action  which  minimizes  the  maximum  loss  that  could 
occur  in  the  worst  possible  case.  This  is  evidently  a  rather  pessimistic  attitude, 
but  it  does  minimize  the  risk  of  a  disastrous  loss. 

Both  principles  require  the  person  making  the  decision  to  give  numerical 
values  to  the  gains,  or  losses,  which  will  ensue  from  the  various  possible  actions. 
Sometimes  this  is  a  fairly  straightforward  matter  of  cost  accounting,  and  the 
values  can  be  given  in  dollars  and  cents.  If  the  problem  is  concerned  with  accept- 
ing or  rejecting  a  lot  of  manufactured  articles,  on  the  basis  of  some  sampling 
scheme,  it  will  generally  be  possible  to  estimate  fairly  accurately  the  costs  of 
sampling  and  inspection  of  individual  items,  and  also  the  losses  involved  in 
accepting  a  poor  lot  or  rejecting  a  good  one.  If,  however,  the  problem  is  to  decide 
between  alternative  medical  treatments  of  a  disease,  the  error  of  saying  that  a 
proposed  new  treatment  is  no  better  than  the  old  one,  when  in  fact  it  is  better, 
may  cost  lives  which  might  have  been  saved  had  the  new  treatment  been  adopted. 
Even  when  the  gain  is  monetary,  it  may  be  argued  that  its  value  is  different  in 
different  circumstances.  A  sum  of  $50  does  not  look  the  same  to  a  millionaire 
and  to  a  hobo.  Economists  have  attempted  to  make  a  scale  of  "utility"  to 
measure  satisfactions  and  preferences,  and  where  costs,  or  losses  and  gains,  are 
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mentioned  later  in  this  book  the  units  may,  if  desired,  be  understood  as  units  of 
utility. 

The  risk  of  making  a  wrong  decision  may  often  be  very  greatly  reduced  by 
taking  a  large  number  of  observations,  but  sampling  and  experimentation  cost 
money,  or  at  least  time  and  effort,  and  this  should  be  reckoned  in  the  total 
accounting.  Wald  therefore  introduced  a  risk  function  which  depends  partly  on 
the  decision  made  or  the  action  taken,  and  partly  on  the  cost  of  experimenting 
so  as  to  have  a  basis  for  decision. 

Suppose  we  base  our  decision  to  take  one  of  A:  possible  actions  al9a2  . . .  ak  on 
a  single  sample  of  TV  observations  of  a  variate  X.  Let  the  cost  be  cN,  a  bounded, 
non-negative  number  depending  on  N  and  possibly  on  the  actual  set  of  obser- 
vations. (If  all  observations  cost  the  same,  cN  is  proportional  to  TV.)  The 
probability  of  action  at  will  depend  on  the  decision  rule  d  which  is  used,  and 
may  be  denoted  by  p(a^d),  and  d  of  course  depends  on  the  set  of  observations 
xl  .  .  .  xN.  There  will  be  a  joint  likelihood  function  f(x1  .  .  .  xN)  for  any  given 
set  of  values  of  X,  and  this  function  will  generally  depend  on  one  or  more 
parameters,  the  values  of  which  represent  the  unknown  state  of  Nature.  For 
convenience  we  will  suppose  that  there  is  only  one  parameter,  0,  which  can  take 
a  set  of  values  symbolised  by  Q. 

If  the  statistician  takes  action  at  when  6  is  really  equal  to  9j9  we  can  suppose 
that  his  loss  is  L(ah  Qj).  Wald  regarded  this  loss  as  always  non-negative,  and 
equal  to  zero  when  the  best  possible  decision  in  the  circumstances  is  made.  Any 
other  decision  involves  a  positive  loss.  The  expected  loss  for  the  given  set  of 
observations  is 

(6.13.1)  E[L(ah  Ojj]  -  £  L(ah  OjMat\d) 

i  =  1 

and  the  expected  loss,  whatever  the  sample  observations  may  turn  out  to  be,  is 

(6.13.2)  n(0,)  =  t  \U<>i>  WtWW  dx 

where /(x)  dx  is  written  forf(x1  .  .  .  xN)  dxx  .  .  .  dxN  and  the  integral  is  over  the 
whole  A/-dimensional  sample  space. 

The  expected  cost  of  the  observations  will  be 


(6.13.3)  r2(0j)=\  cNf(x)dx 

since  this  cost  will  not  depend  on  the  subsequent  action  av  The  risk  function  is 
the  sum  of  rx  and  r2,  namely, 

(6.13.4)  *0j)  =  r1(0j)  +  r2(0j) 

Example  8  A  zoologist  wants  to  estimate  the  average  number  (fi)  of  a 
particular  organism  per  unit  volume  in  the  water  of  a  lake.  He  takes  a  sample 
of  volume  v  and  counts  the  number  of  such  organisms  (X)  in  this  volume.  He 
estimates  \i  by  the  ratio  x/v'{=  m\  where  x  is  the  observed  value  of  X. 
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The  distribution  of  X  may  be  assumed  to  be  Poisson,  so  that 

e-nv 
(6.13.5)  P(X=x\fi)=  {^Y  — 

If  m  is  equal  to  \i,  the  estimate  is  correct  and  there  is  presumably  no  loss. 
The  loss  will  depend  on  the  size  of  the  error.  It  can  hardly  depend  on  the  first 
power  of  the  error,  since  if  it  did  there  would  be  a  negative  loss  (a  gain)  when  the 
error  was  in  one  direction.  The  simplest  thing  is  to  suppose  that  the  loss  is 
proportional  to  (m  —  \i)2 .  The  loss  due  to  estimating  \i  as  m  is,  therefore, 

L(ra,  fi)  =  k(m  —  yt)2 

The  expected  loss,  whatever  the  observed  x,  is 

2  e~l*v 


(6.13.6)  r1(A)=  £/c(--ju)    -(fii;) 

x  =  0     \V 
=  k 


XI 

V 

V 

since  this  is  a  discrete  distribution  and  the  integral  therefore  becomes  a  sum. 
The  cost  of  the  sample  c  may  be  added  to  rl  to  give  the  risk  function. 


6.14  Bayes'  Principle  This  principle  assumes  that  there  is  a  prior  proba- 
bility distribution  for  the  unknown  parameter  6  (which  we  may  think  of  as  a 
state  of  Nature).  We  will  denote  this  probability  density  by  p6.  One  hypothesis 
we  might  make  regarding  Nature  is  that  0  belongs  to  the  set  co  (a  subset  of  the 
set  Q  of  all  possible  values  of  6).  We  investigate  this  hypothesis  by  taking  a  set  of 
observations,  which  have  values  xl9  x2  .  .  .  xN  (collectively  denoted  by  x).  The 

probability  of  this  set,  given  that  6  belongs  to  co,  is  f      P(x\6)pe  d6,  the  in- 

J   ((D)  ' 

tegration  (or  sum)  being  over  all  values  of  9  such  that  6  belongs  to  co.  The  proba- 
bility of  the  same  set,  whatever  the  value  of  0,  is  I  P(x\9)pe  d6.  Therefore,  the 
probability  that  0  belongs  to  co,  given  the  observed  set  of  values  jc,  is 

(6.14.1)  ^.^Jfeffilg&l 

This  rule  was  first  clearly  stated  by  Bayes  [6],  and  used  by  him  for  reasoning  back 
from  the  observed  sample  to  the  population  sampled.  Bayes  recognized,  how- 
ever, that  the  use  of  this  rule  of  inference  depends  upon  knowing  the  prior 
probabilities  p0,  and  except  in  artificial  illustrations  we  seldom  know  much  about 
these  quantities.  Bayes  suggested,  although  apparently  with  some  misgivings, 
that  if  we  know  nothing  whatever  about  pe  we  should  assume  as  a  basis  for 
action  that  all  possible  values  of  6  are  equally  likely.  This  suggestion  was  adopted, 
rather  uncritically,  by  Laplace,  but  it  was  so  vigorously  attacked  in  recent  times 
(mainly  by  Fisher)  that  the  rule  fell  into  disrepute.  It  is  now  beginning  to  be 
generally  recognized  that  Bayes'  approach  may  be  very  helpful  in  certain 
situations. 
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If  the  loss  function  for  action  at  and  state  of  Nature  9  is  L(ah  9),  and  if  the 
prior  probability  density  of  9  is  pe,  the  expected  loss  associated  with  at  is 


(6.14.2)  Uad  = 


L(ah  9)pe  dO 


The  Bayes  principle  is  to  choose  that  particular  a{  for  which  L{at)  is  a  minimum, 
or,  which  comes  to  the  same  thing,  that  for  which  the  utility  is  a  maximum.  If 
some  information  is  available  on  pd,  from  other  experiments  or  from  intuition, 
this  can  be  used,  but  if  no  information  is  available,  p0  is  to  be  taken  as  uniform 
over  all  9. 

Example  9  A  dealer  buys  fuses  in  lots  of  10,000  and  sells  them  at  10  cents 
each,  with  a  double-money-back  guarantee  if  they  prove  defective.  To  protect 
himself  he  takes  a  sample  of  N  for  destructive  testing,  and  refuses  to  buy  the  lot 
if  n  or  more  of  the  samples  are  defective.  What  value  should  he  choose  for  n  ? 

The  probability  of  x  defectives,  if  the  proportion  of  defectives  in  the  whole 
lot  is  0,  is 


■0 


b(x,  JV,  0)  =         0*(1  -  9) 


,N-x 


The  probability  of  accepting  a  lot  with  proportion  9  is  £"  =  o  b(x,  N,  9) 
—  1  —  B(n,  A/,  9)  and  the  expected  net  income  in  dollars  received  from  such  a 
lot  is 

u(n,  9)=(\-  20)^1000  -  ^)(1  -  B(n,  N,  9)) 

since  N  of  the  10,000  have  been  destroyed  in  sampling,  and  the  dealer  has  to 
pay  out  20  cents  for  each  defective  one  he  sells.  Suppose  he  estimates  the  prior 
probability  of  9  as  pe.  His  expected  income  from  the  decision  rule  he  has 
adopted  is 


u(n)  =  I    u(n, 


9)-ped9 


and  he  should  choose  n  so  as  to  make  this  as  great  as  possible.  If  he  feels  that 
any  value  of  9  is  as  likely  as  any  other,  he  will  put  pe  =  1,  and  maximize  the 
quantity 

(6-143)        m^NjTo  =  £  %  (*)(1  -  2W  -  6)N~X  M 

The  integral  can  be  evaluated  in  terms  of  beta  functions  and  reduces  to 

— — — — ,  which  is  a  maximum  when  n  =  (N  +  l)/2.  The  lot  should  be 

rejected  if  there  are  k  or  more  defectives  in  a  sample  of  size  2k  —  1 . 
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It  may  be  considered  unrealistic  to  suppose  that/?0  is  constant,  and  the  dealer 
could,  for  example,  suppose  that  0  is  equally  likely  to  be  anywhere  between 
0.01  and  0.05,  but  is  quite  unlikely  to  be  outside  these  bounds.  That  is,  he  could 
assume  a  rectangular  distribution  for  0  .  The  integral  in  Eq.  (3)  will  then  be 
expressible  in  incomplete  beta  functions,  and  by  the  use  of  tables  the  maximizing 
value  of  n  can  be  found. 

6.15  Wald's  Principle  (Minimax  Principle)  To  avoid  having  to  estimate  the 
prior  probabilities,  Wald  suggested  the  principle  of  choosing  the  action  which 
would  minimize  the  maximum  risk  that  could  be  feared  whatever  the  state  of 
Nature  might  be.  In  the  example  above,  if  9  could  be  as  high  as  1,  or  even  a 
little  more  than  0.5,  this  principle  would  tell  the  dealer  to  refuse  to  accept  the  lot 
without  troubling  to  sample  it  at  all.  If,  however,  he  feels  that  the  worst  possible 
lot  would  have  a  9  equal  to  0.05,  say,  he  will  choose  n  so  as  to  maximize  u(n,  0.05). 
This  will  mean  accepting  the  lot  without  sampling,  since  then  he  gets  as  much 
income  as  possible  and  avoids  the  loss  of  the  fuses  destroyed  in  testing. 

Example  10  This  is  a  simplified  betting  problem,  suggested  by  Sprowls  [7]. 
The  bettor  has  two  possible  actions  in  each  case,  to  bet  or  not  to  bet.  A  bet  is 
always  to  win  and  always  at  the  same  odds ;  if  he  wins  he  gains  a,  if  he  does  not 
win  he  loses  /?.  He  has  a  system  which  gives  a  probability  9  of  picking  a  winner, 
and  he  decides  whether  or  not  to  bet  on  any  particular  race  by  the  number  of 
wins,  x,  recorded  in  the  N  previous  races  on  which  he  has  bet.  If  x  >  n,  he  will 
bet;  if  x  <  n,  he  will  not.   The  problem  is  to  decide  on  n. 

Assuming  that  the  races  can  be  treated  as  statistically  independent  events,  the 
probability  of  exactly  x  wins  is  b(x,  N,  9)  so  that  the  probability  of  betting  is 
B(n,N,9).  If  the  bettor  decides  to  bet,  his  expected  loss  per  race  is  /?(1  —  9)  —  a0, 
which  is  positive  for  9  <  0O,  where  90  =  /?/(a  +  /?). 

If  he  decides  not  to  bet  at  all,  his  gain  will  be  zero,  but  he  will  lose  what  he 
might  have  won  by  betting  if  9  >  90.   The  risk  function  is 

(6.15.1)         r(n,  9)  =  [0(1  -  9)  -  aff]-B(n9  N,  9),        0  <  0O 

r(n,  9)  =  [«0  -  P{\  -  0)]  •  [1  -  B(n,  N,  0)],         0  >  0O 

Using  the  normal  approximation  to  the  cumulative  binomial,  we  have  B  « 

1  -  O(z),  where 

tti<<»  n  -  1/2  -  N9 

(6-15-2)  Z  =  [A/0(l-0)]^ 

The  Wald  principle  is  to  pick  n  so  as  to  minimize  the  maximum  value  of  r  over 
all  possible  0.  This  minimum  occurs  when  the  maximum  of  r  for  0  <  0O  is  equal 
to  the  maximum  of  r  for  0  >  0O.  The  actual  calculation  of  these  maxima  can  be 
done  numerically  with  the  help  of  good  tables  of  the  normal  law  (e.g.,  reference 
[8]  of  Chapter  3),  and  it  turns  out  that  the  maxima  occur  at  0  «  0O  ±  0.752 
[0O(1  —  90)/N]1/2.  The  approximate  solution  of  the  problem  is  to  take  n  as 
MJ0  =  A7?(a  +  /?)-1,  and  if  x  >  n  to  bet  on  the  next  race. 
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6.16  Game  Theory  and  Statistics  A  subject  which  has  come  to  the  fore  in 
recent  years  is  the  theory  of  games,  which,  besides  its  application  to  ordinary 
parlor  games,  has  a  very  distinct  relevance  to  economics  and  to  military  strategy. 
A  good  many  statistical  problems  can  be  thought  of  as  games  played  against 
Nature,  although  Nature  of  course  is  not  a  malevolent  opponent,  out  to  make 
things  as  bad  as  possible  for  the  statistician.  It  is  hardly  surprising  that  in  these 
circumstances  Wald's  principle  is  often  unduly  pessimistic,  and  several  other 
decision  criteria  have  been  suggested.  References  [8],  [9]  and  [10]  may  be  con- 
sulted for  more  details. 


PROBLEMS 
A.  (§§6.1-6.5) 

1.  Prove  that  the  maximum  likelihood  estimator  for  the  parameter  /xofa  Poisson 
distribution  is  the  sample  mean  m,  and  that  the  variance  of  m  is  fi/N,  where  TV  is  the 
sample  size. 

2.  Show  that,  for  the  Poisson  distribution,  misa  sufficient  estimator  for  ju.  Also 
show  that  condition  (b)  of  Eq.  (6.4.3)  is  satisfied. 

3.  For  a  normal  population  with  mean  /x  and  variance  <r2,  the  mean  and  the  median 
of  a  sample  of  TV  are  both  consistent  estimators  of  /x.  For  large  TV  the  variance  of  the 
median  is  approximately  7to2/2N.  Show  that  as  an  estimator  of  tx  the  median  is  roughly 
64%  efficient. 

4.  Suppose  that  the  mean  /x  of  a  normal  population  is  known,  but  that  the  variance 
a2  is  to  be  estimated.  Show  that  the  sample  variance  &2,  although  unbiased,  has  an 
efficiency  (TV  —  1)/TV  and  is  therefore  only  asymptotically  most  efficient,  while  the 
sample  second  moment  about  /x  is  both  unbiased  and  most  efficient,  for  any  TV. 

5.  Show  that  for  a  binomial  population  with  probability  of  success  6  in  each  trial, 
the  maximum  likelihood  estimator  of  6  is  the  proportion  of  successes  p  in  a  sample. 
Show  also  that  the  variance  of/?,  as  given  by  Eq.  (6.3.3),  agrees  with  that  previously 
found,  namely,  0(1  -  0)1  N. 

6.  A  one-parameter  gamma  variate  has  the  density  function  f(x)  =  x^e^/Via), 
x  >  0.  Write  down  the  equations  for  determining  from  a  sample  of  size  TV  the  maximum 
likelihood  estimator  for  a  and  its  variance.  (In  order  to  solve  these  equations,  tables 
of  the  digamma  function  d  log  Y(a)lda.  and  the  trigamma  function  d2  log  F(tx)/doc2 
must  be  used.  (See  H.  T.  Davis,  Tables  of  the  Higher  Mathematical  Functions, 
Bloomington,  Indiana,  1933-5.) 

7.  Prove  that  an  unbiased  estimator  of  a  in  Problem  6  is  the  arithmetic  mean  m 
of  the  sample.  Show  also  that  the  efficiency  of  this  estimator  is  {a  d2[log  r(a)]/cfa2}"1. 
(This  quantity  tends  to  zero  as  a  decreases  to  0.  The  nearer  a  is  to  zero  the  more  skew 
is  the  distribution.) 

8.  The  mean  absolute  deviation  for  a  sample  of  size  TV  and  mean  m  is  defined  as 
d  =  ^i\xi  —  m\/N.  For  samples  from  a  normal  population  of  mean  /x  and  variance 
ct2,  the  variance  of  d  is  given  by 

\N2     }  {frr  +  [TV(TV  -  2)Y'2  -  TV  +  sin-i[l/(TV  -  1)]} 

Compare  the  asymptotic  efficiency  of  the  quantity  dVn/2  with  that  of  the  sample 
standard  deviation  as  estimators  of  a.  Hint:  Prove  that  V(d)  =  <r2(l  —  2/tt)/N  + 
0(1/TV2),  and  note  that  V(s)  =  a2/2N  +  0(1/TV2). 
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9.  Show  that  for  the  continuous  distribution  with  density  fix)  =  de~ex,0  <  x  <  oo, 
confidence  limits  for  6  for  a  large  sample,  with  confidence  coefficient  0.95,  are  given  by 
(1  ±  1.96/ 'VN)/m  where  m  is  the  sample  mean.  Hint:  For  large  N,  the  m.l.  estimator  is 
approximately  normal. 

10.  Show  that  for  the  rectangular  population  with  density  f(x)  =  (j8  —  a)-1, 
0  <  a  <  x  <  j8,  joint  maximum  likelihood  estimators  for  a  and  j8  are  the  smallest  and 
largest  members  of  the  sample  respectively.  Hint:  Show  that  these  give  the  greatest 
possible  value  for  L. 

11.  Suppose  that  the  discrete  variate  X  is  binomially  distributed  except  that  it  cannot 
take  the  value  0.    The  probability  that  X  =  x(x  =  1,  2 . . .  n)  is  given  by  f{x)  = 

$)n-z[\  _  (i  _  #)«]-!.   if  the  numbers  of  successes  in  N  repetitions  of  the 


o 


sequence  of  n  trials  are  xi,  X2, .  .  .  xn,  show  that  the  maximum  likelihood  estimator  0  of 
6  is  given  by  the  solution  of  the  equation  n@  =  m[\  —  (1  —  #)"],  where  m  is  the  arith- 
metic mean  of  the  xu  Find  #  for  the  case  n  —  2. 

12.  Obtain  an  equation  for  the  maximum  likelihood  estimator  of  p  derived  from  a 
sample  of  size  N  from  the  bivariate  standard  normal  population  with  joint  density 
function 


fix,  y) =  (2xr)-i(l  -p2)-i/2eXp 


1  x2  —  2pxy  +  y'< 

2  1  -p2 


Show  that  the  variance  of  this  estimator  is  (1  -  p2)2/[N(1  +  p2)].    Hint:  £(*2)  = 
E(y*)  =  l,Eixy)=P. 

B.  (§§  6.6-6.12) 

1.  The  yield  in  bushels  of  a  certain  type  and  size  of  potato  plot  is  found  to  be 
normally  distributed  with  a  standard  deviation  of  2.36.  It  is  hoped  that  the  application 
of  a  certain  fertilizer  will  increase  the  yield  by  at  least  0.5  bushel.  How  large  a  sample 
of  plots  should  be  used  to  detect  a  difference  of  this  amount,  using  the  mean  sample 
yield  as  a  criterion,  with  a  test  of  size  5  %  and  power  90  %  ? 

2.  An  experimenter  knows  that  a  distribution  is  approximately  normal  with 
standard  deviation  1.2.  He  wishes  to  test  the  hypothesis  that  the  population  mean  is 
75  against  the  alternative  hypothesis  that  p,  >  75,  using  a  sample  of  size  N  and  a  test 
of  size  1  %.  What  test  should  he  use  ?  Calculate  the  power  for  iV  =  9  and  for  p,  =  75.5, 
76,  76.5.  What  size  sample  should  he  take  if  he  wants  to  be  95%  sure  of  detecting  a 
difference  as  small  as  one  unit  from  the  assumed  value  75,  still  using  a  test  of  size  1  %  ? 

3.  A  population  has  the  Poisson  distribution  with  parameter  p,,  which  may  have 
the  values  1  or  2  but  no  others.  Find  the  likelihood-ratio  test  for  testing  /fi(that  /x  =  1) 
against  Hi  (that  p,  =  2),  using  the  mean  of  10  observations  of  X  as  the  criterion.  Assume 
that  the  probability  of  error  of  the  first  kind  is  not  greater  than  0.05,  and  calculate  the 
power  of  the  test.  Hint:  The  distribution  of  the  sum  of  Af  independent  Poisson  variates 
with  parameter  p,  is  also  Poisson  with  parameter  Np..  Use  a  table  of  the  cumulative 
Poisson 'function  for  numerical  results. 

4.  Develop  the  likelihood-ratio  test  for  a  binomial  population,  for  testing  the  simple 
hypothesis  6  =  60  against  the  simple  alternative  d  =  0i  (where  0i  >  0o),  using  as  a 
criterion  the  number  x  of  successes  in  the  first  n  trials.  If  the  size  of  the  test  is  approxi- 
mately a  and  the  power  approximately  1  —  y8,  find  a  relation  to  determine  n.  Give 
a  numerical  result  for  0o  =  0.5,  0i  =  0.7,  a  =  0.05,  ft  =  0.10.  Hint:  Use  the  normal 
approximation  to  the  binomial. 

5.  Find  the  likelihood-ratio  test  for  testing  the  significance  of  the  difference  between 
the  mean  of  a  sample  and  an  assumed  population  mean  p;o,  the  population  being 
normal  with  unknown  standard  deviation  a.   Hint:  Find  the  ratio  of  the  maximum 
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likelihood  over  all  a  for  a  given  /xo  to  that  over  all  a  and  all  /u.  The  test  is  equivalent  to 
Student's  f-test,  which  will  be  discussed  more  fully  in  Chapter  8. 

6.  Assume  that  the  random  variables  Yi,  /  =  1,  2  . .  .  N,  are  independent  and 
normally  distributed  with  means  v  =  a  +  f$Xi,  and  a  common  variance  cr2,  for  known 
values  of  xt.  Write  down  the  likelihood  function  for  the  set  of  Y%  and  hence  obtain 
joint  maximum  likelihood  estimators  for  the  parameters  a,  j8  and  a.  (These  estimators 
will  be  discussed  more  fully  in  Chapter  11.) 

7.  It  is  desired  to  test  the  null  hypothesis  that  a  certain  coin  is  fair  (that  is,  that  the 
probability  9  of  a  head  when  the  coin  is  tossed  is  0.5)  by  counting  the  number  of  heads 
x  in  n  tosses.  Show  that  the  likelihood-ratio  test  of  Ho  against  the  alternative  hypo- 
thesis Hi  (that  9  is  either  less  than  or  greater  than  0.5)  is  equivalent  to  rejecting  Ho 
when  \x  —  n/2\  >  k,  where  k  is  determined  by  the  size  of  the  test. 

If  the  test  is  to  be  of  size  0.05  and  power  0.9  to  detect  a  difference  of  0.02  in  9  from 
the  assumed  value  0.5,  how  large  should  n  be?  Hint:  The  likelihood-ratio  test  may  be 
written:  reject  Ho  when  f(x)  >  c,  where  f{x)  =  x\og(x/n)  +  («  —  x)log(l  —  x/n). 
Show  that  f(x)  has  a  minimum  at  x  =  n/2  and  is  symmetrical  about  this  value.  For 
the  second  part  of  the  question  use  the  normal  approximation  to  the  binomial. 

C.  (§§6.13-6.15) 

1.  Carry  out  the  integration  indicated  in  Eq.  (6.14.3)  and  show  that  it  reduces  to 
the  stated  value.  Hint:  Use  Eq.  (4.5.3)  and  express  the  gamma  functions  as  factorials. 

2.  A  bag  contains  10  balls,  either  black  or  white,  but  it  is  not  known  how  many  of 
each.  A  ball  is  drawn  at  random,  looked  at  and  replaced,  and  three  times  running  the 
ball  so  selected  is  white.  What  is  the  probability  that  the  bag  contains  at  least  five 
white  balls?  Hint:  Use  Bayes'  rule,  with  sums  instead  of  integrals.  Obtain  a  numerical 
result  by  assuming  a  constant  value  for  the  prior  probability  of  9  white  balls  (9  =  0,  1, 
2  .  .  .  10). 

3.  Instead  of  the  assumption  at  the  end  of  Problem  2,  suppose  that  the  bag  was 
filled  by  picking  10  balls  at  random  from  a  very  large  number  of  black  and  white  balls 
mixed  in  equal  proportions.  What  is  now  the  probability  of  at  least  five  white  balls, 
after  seeing  the  three  white  balls  drawn?  Hint:  The  prior  probabilities  are  binomial. 

4.  A  set  of  100  independent  observations  is  made  on  a  variate  X  which  is  normally 
distributed  with  unknown  mean  /x  and  known  variance  25  units.  The  null  hypothesis 
Ho  is  that  //,  =  0,  and  the  alternative  hypothesis  Hi  is  that  /x  =  2  (these  are  the  only 
possibilities).  The  decision  whether  to  accept  Ho  or  Hi  is  made  on  the  basis  of  the  mean 
(m)  of  X  for  the  100  observations.  If  Ho  is  true,  the  losses  corresponding  to  these  two 
decisions  (do  and  di)  are  0  and  25,  respectively;  if  Hi  is  true  the  losses  are  10  and  0, 
respectively. 

Given  that  £  is  the  prior  probability  of  Ho,  show  that,  on  the  Bayes  principle  of 
minimizing  the  expected  loss,  the  decision  do  should  be  taken  if  m  <  c  where  c  =  1  + 
(1/8)  loge{5|/[2(l  —  I)]}.  Hint:  The  mean  m  is  normally  distributed  with  variance 
£.   Use  Bayes'  rule  to  find  the  probability  of  Ho  after  the  sample  has  been  examined. 

5.  In  Problem  4  above,  find  the  probability  a(£)  of  rejecting  Ho  if  true  and  the  proba- 
bility /3(£)  of  accepting  Ho  if  false. 

The  expected  loss  is  25£a(£)  +  10(1  —  £)j8(£).  Compute  this  quantity  for  various 
values  of  $  between  0  and  1  and  find  approximately  for  what  value  the  expected  loss  is 
a  maximum. 

6.  Solve  Problem  4  above  using  the  minimax  principle.  With  £  unknown,  the 
maximum  risk  is  25a(£)  if  Ho  is  true  and  10/3(|)  if  Hi  is  true.  The  minimum  of  this 
maximum  risk  occurs  when  25a(£)  =  10/3(£).  Show  that  c  is  then  1.096,  and  that  the 
corresponding  value  of  £  is  about  0.46.  (This  is  the  least  favorable  value  of  f .) 

7.  A  buyer  of  manufactured  articles  in  large  lots  is  willing  to  accept  a  lot  if  the 
proportion  9  of  defective  articles  is  less  than  90,  but  will  wish  to  reject  it  if  9  >  0o.  If 
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he  accepts  a  lot,  his  loss  Li(B)  will  be  zero  if  6  <  Bo  but  positive  if  6  >  Bo.  If  he  rejects 
a  lot  his  loss  L*(B)  will  be  zero  if  6  >  60  but  positive  if  B  <  Bo.  He  bases  his  decision  on 
the  number  of  defectives  r  in  a  sample  of  TV  (assumed  binomial).  What  will  be  his 
decision  rule  on  the  Bayes  principle  if  the  prior  probability  of  B  is  £(B)  ?  Show  that  this  is 
equivalent  to  the  rule :  accept  the  lot  if  r  <  c,  where  c  is  some  fixed  number.  Hint:  He 
will  accept  with  r  defectives  if  the  expected  loss  in  accepting  is  less  than  the  expected 
loss  in  rejecting.  Show  that  if  this  is  true  for  r  =  c  it  is  also  true  for  r  —  c  —  1  and  so 
for  all  r  <  c. 
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Chapter  7 

SOME  SAMPLING  PROCEDURES 

7. 1  Random  and  Less  Random  Samples  Sampling  is  undertaken  in  order  to 
find  out  something  about  a  population  without  having  to  examine  every  item 
in  it.  By  "a  population,"  we  mean  any  collection  (usually  large)  of  elements 
such  as  people,  pigs,  farms,  coin-tosses,  incomes,  or  whatever  it  may  be,  about 
which  we  want  some  information.  Often  an  important  decision  must  be  made  on 
the  basis  of  knowledge  obtained  from  the  sample,  so  that  it  is  useful  to  be  able  to 
estimate  how  reliable  this  knowledge  actually  is.  Sampling  theory  is  concerned 
with  ways  of  estimating,  and  perhaps  improving,  the  precision  of  the  information 
obtainable  from  a  sample  about  a  population. 

Any  procedure  for  making  such  an  estimate  must  be  based  on  the  theory  of 
probability.  That  is,  it  must  suppose  that  the  sample  is  random.  Most  of  the 
theory  of  estimation,  hypothesis-testing  and  decision-making  that  we  have  been 
considering  in  the  last  two  chapters  is  based  on  the  concept  of  a  random  sample. 
A  sample  of  given  size  is  said  to  be  random  if  every  possible  sample  of  this  size 
in  the  population  (supposedly  finite)  has  a  calculable  probability  of  being 
chosen,  but  this  probability  need  not  be  the  same  for  all  items.  If,  however,  the 
sample  (of  size  N)  is  selected  in  such  a  way  that  every  combination  of  N  elements 
in  the  population  has  an  equal  probability  of  being  chosen,  the  process  is  called 
simple  random  sampling.  This  is  the  usual  assumption  in  theoretical  statistics, 
although  in  actual  sample  surveys  simple  random  sampling  is  rarely  used.  For 
reasons  of  cost  and  administrative  convenience,  as  well  as  in  order  to  improve 
precision,  some  modification  of  the  simple  random  design  is  generally  adopted. 

For  a  sample  of  TV  from  a  finite  population  of  size  M  the  probability  that 
any  individual  item  will  be  drawn  is  N/M.  If  the  population  is  infinite, 
this  probability  is  zero,  but  it  still  makes  sense  in  many  cases  to  assume  that 
one  item  is  as  likely  to  be  drawn  as  another  (see  §  5.1).  When  a  number  of 
tosses,  made  with  a  particular  coin,  is  considered  as  a  sample  of  the  practically 
infinite  number  of  tosses  that  might  conceivably  be  made  with  this  same  coin,  the 
sample  is  obviously  not  random  in  the  strict  sense,  since  it  consists  of  the  first  TV 
items  of  the  population  in  order  of  time.  However,  we  make  the  physical 
assumption  that  the  order  in  time  is  quite  irrelevant  as  far  as  the  characteristic 
of  any  toss  (heads  or  tails)  is  concerned,  so  that  the  first  N  tosses  form  effectively 
a  random  sample. 

If  some  prior  information  is  available,  it  may  be  possible  to  use  stratified 
sampling  and  so  gain  in  precision  over  simple  random  sampling.  In  this  pro- 
cedure, the  population  is  divided  into  groups,  the  elements  within  a  group  being 

151 


152  INTRODUCTION  TO  STATISTICAL  INFERENCE  7.1 

more  alike  than  those  in  the  population  as  a  whole. ,  If  a  simple  random  sample 
is  drawn  from  each  group,  we  still  have  a  probability  sampling  procedure,  but 
we  have  insured  that  each  group  is  represented  in  the  total  sample.  The  groups 
are  called  strata,  and  the  process  of  dividing  the  population  into  groups  is  called 
stratification.  This  procedure  normally  reduces  the  sampling  variance  of  the 
variate  measured.  It  is  particularly  effective  when  there  are  extreme  values  in 
the  population — stratification  with  regard  to  income  levels,  for  example,  is  a 
common  practice.  The  costs  of  sampling  may  differ  considerably  from  one 
stratum  to  another  (as  between  urban  and  rural  households,  for  instance)  and 
these  costs  may  be  important  in  setting  up  the  strata. 

The  people  who'  conduct  sample  surveys  are  usually  much  concerned  with 
questions  of  cost.  ,  They  want  the  maximum  precision  per  dollar  spent,  and 
therefore  tend  to  favor  cluster  sampling.  This  is  a  method  of  reducing  costs  by 
first  taking  a  random  sample  of  groups  or  clusters  and  then  taking  sub-samples 
from  the  clusters  selected.  To  take  a  sample  of  3000  households  from  the  popu- 
lation of  the  United  States,  we  might  first  draw  a  sample  of,  say,  50  counties 
and  then  sample  these  proportionately  to  their  total  populations.  A  simple 
random  sample  would  probably  be  spread  over  many  more  than  50  counties,  and 
would  need  much  more  travel  and  supervision. 

Cluster  sampling  may  not  be  very  efficient  as  far  as  precision  of  the  estimate  is 
concerned.  The  best  results  occur  when  the  clusters  each  contain  very  diversified 
elements — just  the  opposite  from  the  requirements  for  stratified  sampling. 

Another  common  procedure  in  some  types  of  sampling  surveys  is  systematic 
sampling.  To  draw  500  cards  from  a  file  containing  10,000  cards,  we  could 
select  a  random  number  between  1  and  20  (say  13)  and  then  take  every  20th 
card,  beginning  with  the  13th.  That  is,  we  could  pick  the  cards  numbered  13, 
33,53,  and  so  on.  If  the  order  of  the  cards  has  nothing  whatever  to  do  with  the 
variate  for  which  we  are  sampling,  this  gives  us  effectively  a  random  sample  and 
it  is  easy  to  apply.  To  sample  housing  units  in  a  city  one  might,  for  instance,  take 
every  12th  block  and  every  fifth  house  in  the  block,  but  it  would  be  well  to  make 
sure  that  the  procedure  adopted  did  not  lead  to  picking  out  an  undue  proportion 
of  corner  houses — at  least  if  the  object  of  the  survey  has  any  connection  with 
economic  status.  Corner  houses  often  pay  higher  taxes  and  are  generally 
occupied  by  people  with  higher  incomes  than  non-corner  houses. 

A  method  of  selecting  a  sample  often  employed  in  public-opinion  polls  is 
that  of  quota  sampling,  in  which  an  interviewer  is  instructed  to  fill  a  specified 
quota  by  finding  as  best  he  can  persons  satisfying  certain  restrictions — he  may 
be  asked  to  contact  a  specified  number  of  persons  of  a  particular  sex,  age-group, 
and  income-group,  for  example,  but  no  attempt  is  made  to  make  the  sample 
random.  This  method  is  apt  to  introduce  a  completely  unknown  bias  into  the 
estimates  made. 

Any  method  of  sampling  which  is  not  random,  but  tries  to  pick  out  a  typical 
or  representative  sample,  may  be  called  purposive  sampling.  This  may  be  useful 
if  only  a  very  small  sample  can  be  taken,  and  if  the  person  picking  the  sample  has 
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good  judgment  and  expert  knowledge,  but  there  is  no  statistical  theory  available 
for  measuring  the  reliability  of  the  results  obtained.  Sometimes,  of  course,  a 
random  sample  is  from  the  nature  of  things  practically  unobtainable — a  sample 
of  fish  from  the  sea,  for  instance — and  we  are  forced  to  use  any  kind  of  sample 
we  can  get.  Nevertheless  a  probability  sample  should  be  obtained  whenever 
possible,  and  only  then  is  the  theory  of  sampling  strictly  applicable.  For  a  fuller 
discussion  of  sampling  procedures  see  [1]  and  [2]. 

*  7.2  Stratified  Sampling  Suppose  the  population  of  size  M  is  divided  into 
strata  of  sizes  Mu  M2  .  .  .  Mk9  and  a  simple  random  sample  of  size  Nt  is  taken 
from  the  Ith  stratum.  If  Xia  is  the  measured  characteristic  for  the  ath  item  in  the 
ith  stratum,  the  Ith  stratum  mean  for  the  population  is 

(7.2.1)  tt^ir  E*to        lMf=M, 
and  the  over-all  mean  for  the  population  is 

(7.2.2)  M--1  t^tfii 

M  i=i 

The  estimator  of  /j,  based  on  the  stratified  sample,  is  X,  where 

1  1    Ni 

(7.2.3)  X  =  -  £  M{Xtr       Xt=-  gXu 

and  Xij  is  the  value  of  X  for  the/h  item  in  the  sample  from  the  ith  stratum.  Note 
that  X  is  a  weighted  mean  of  the  sample  stratum  means,  with  weights  depending 
on  the  sizes  of  the  strata  in  the  population,  so  that  these  sizes  must  be  known 
fairly  accurately  before  the  method  can  be  used. 

From  Eq.  (5.7.3),  E(Xt)  =  fih  and  therefore,  by  Eq.  (3)  above, 

(7.2.4)  E{X)=±-YMiE{Xi)=n 

M 

by  Eq.  (2).  That  is,  X  is  an  unbiased  estimator  of  \i.  We  will  now  show  that  the 
variance  of  X  is  given  by 

(7-2.5)  nx).^***^. 

where  of  =  (Mt  -  l)-1]T(^a~  nd2,  the  population  variance  in  the  Ith 
stratum.  a 

Since  X  is  a  linear  combination  of  the  Xt  with  coefficients  MJM,  and  since 
the  strata  are  sampled  independently,  we  can  use  Bienayme's  Theorem  (§  2.14). 
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by  Eq.  (5.1 1.7),  where  K2i  =  a2.  If/  is  the  sampling  fraction  NJMU 

1        M2 

(7.2.6)  F(X)  =  _£_L(1  _/,)„* 

which  is  equivalent  to  Eq.  (5). 

Since  k2i[=  (Nt  —  l)"1  £/(^/  —  Xt)2]  is  an  unbiased  estimator  of  k2U 
we  may  estimate  the  variance  of  X  by  means  of 

(7.2.7)  P(J)=_L£  ML  (!-/,)** 

and  k2i  is  the  sample  variance  of  the  sample  from  the  Ith  stratum. 

If/  is  the  same  for  each  stratum,  and  equal  to/,  say,  the  sampling  is  said  to  be 
proportionate  (the  Nt  are  proportional  to  the  Mt).  Then 

(7.2.8)  VW^IjOi2 


MiV 


ZM^: 


since  N  =  f  M. 

If  X  and  y  are  two  variates,  both  measured  for  each  item  in  the  sample,  the 
co variance  of  X  and  Y  is  similarly  given  by 

1        M2 

(7.2.9)  CVM-jp^-fi-iX-ffa 

where  nt  =  (Mf  -  l)"1  Yj*(Xi*  ~  M»)(^»a  _  vi)»  tne  population  covariance  in 
the  ith  stratum,  vf  being  the  stratum  mean  for  7.  As  before,  nt  may  be  estimated 
from  the  sample  covariance  in  this  stratum. 

The  gain  in  precision  due  to  using  proportionate  sampling,  compared  with 
simple  random  sampling  from  the  whole  population,  may  be  found  by  com- 
paring Eq.  (7.2.8)  with  the  expression  for  the  variance  of  the  mean  of  a  random 
sample,  namely, 

(7.2.10)  Vr(X)  =  a2(l-f)IN 

where  a2  =  (M  —  l)"1  £a  (Xa  —  /*)2,  and  a  takes  any  value  from  1  to  M. 
Therefore, 

(7.2.11)  V£X)  -  V(X)  =  ±jj£  (a2  -  ±  X  Mtf) 

In  practice,  the  Mt  are  usually  so  large  that  the  distinction  between  M(  and 
Mt  —  1  is  unimportant,  and  we  can  put 

M  Mi 

m<t2  *  I  (jr.  -  n)2,      m^2  *  I  (Jr.  -  nf 

a=l  a=l 


7.3  SOME  SAMPLING  PROCEDURES  155 

Since 

(Xa  -  fr?  =  (X.  -  fi)2  +(fi-  fid2  +  2<ji  -  fidiX.  -  fi) 


we  have 


Mtf  w  £  (Xa  -  fi)2  +  Miji  -  lid2  +  2Mfr  -  iidiHi  -  n) 
1 

Mi 

-ll(Xu-tf-Mtii-iif 

i 


so  that 


(7.2.12)  VAX)  -  V(X)  «  ^£  M^  -  j^)2 

and  this  expression  is  never  negative.  The  gain  from  proportionate  sampling 
is  greater,  the  greater  the  differences  between  the  stratum  means. 

The  question  arises  as  to  the  optimum  choice  of  the  Nt.  It  was  proved  by 
Neyman  [3]  that  for  a  fixed  TV  the  variance  of  X  is  least  when  Nt  is  proportional 
to  Mp^  That  is,  we  should  choose  Nt  so  that 

(7.2.13)  NijN=MiGiIYj<MiGl). 

This,  of  course,  supposes  that  some  information,  from  a  preliminary  survey 
or  from  previous  experience,  is  available  about  the  g{. 

If  the  cost  ct  per  unit  of  sampling  also  varies  from  one  stratum  to  another, 
and  if  the  total  cost  c  of  the  whole  survey  is  fixed,  it  may  be  shown  that  the 
optimum  sampling  number  for  the  /th  stratum  is  proportional  to  MiGi\cl17'. 

*  7.3  Cluster  Sampling  In  simple  cluster  sampling,  the  elements  of  the 
population  are  grouped  in  clusters  which  themselves  are  the  primary  sampling 
units.  In  a  one-stage  plan  all  the  elements  in  the  selected  clusters  (picked  by 
simple  random  sampling)  are  included  in  the  sample.  In  a  two-stage  plan  a 
random  sub-sample  is  selected  from  each  primary  sampling  unit,  and  of  course 
further  stages  of  sampling  can  be  introduced. 

We  will  suppose  that  there  are  k  clusters  in  the  population,  with  sizes 
Mt(i  =1,2.../:),  and  /  of  these  are  selected  in  the  first  stage.  From  the/h 
selected  cluster  (J  =  1,  2  .  .  .  /),  the  number  of  second-stage  items  picked  is  Nj. 
This  is  represented  diagrammatically  in  Figure  37,  where  samples  are  indicated 
from  some,  but  not  all,  of  the  clusters.  Let  Xia  be  the  value  of  X  for  the  ath 
item  in  the  ith  cluster,  and  Xjh  the  value  for  the  hth  item  picked  from  the  jth 
selected  cluster  (a  =  1,  2  .  .  .  Mi9  h  =  1,  2  . . .  Nj). 

With  a  notation  similar  to  that  used  before, 

Mm  =  £  Xia  Mn  =  £  Mi/ii 

Kj*J=ixjH  NX^NjXj 

h  j 

M=ZMt  N=£Nj 
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We  take  as  the  estimator  of  \x  the  quantity 

^       k 


7.3 


(7.3.1) 

*  =  mW 

and  it  can 

be  shown  that 

(7.3.2) 

E(X)=n 

and 

(7.3.3) 

K(x)=Mfc2/[(fe-/y  +  z^M;-^ 

where 

(7.3.4) 

x2=(k-i)-1Y,(M^-MnJkf 

A 


and  a?  has  the  same  meaning  as  in  Eq.  (7.2.5). 

k  „ 
From  Eq.  (1),  E(X)  =  —  £,-  E{MjXj).    Now  the  actual  value  of  MjXj 


Fig.  37    Cluster  sampling 


(7.3.5) 


E(MjXj)  =  E(Mjfij)  =  -  £  Muh 


depends  upon  two  random  events,  the 
selection  of  they th  cluster  and  the  selec- 
tion of  the  Nj  items  from  this  cluster. 
It  is  shown  in  Appendix  A.  14  that  if 
X  is  a  random  variable  depending  on 
Y  which  is  itself  a  random  event,  then 
E(X)  =  E[E(X\  Y)].  Here  the  event  Y 
is  the  choice  of  the  j th  cluster.  Given 
this  choice,  the  expectation  of  MjXj  is 
Mjiij,  where  \ii  is  the  mean  for  the 
whole  cluster.  Therefore 


1 


since  this  cluster  is  one  of  k  clusters,  all  with  an  equal  chance  of  being  picked.  It 
follows  that 


(7.3.6) 


Mlj     i 


To  find  the  variance  of  X  we  need  Theorem  2  of  Appendix  A.  14,  according 
to  which  V(X)  =  £[F(Jif|y)]  +  V[E(X\Y)]9  where  X  is  replaced  by  X  and  Y 
means  the  choice  of  a  particular  set  of  /  clusters  from  the  set  of  all  k  clusters  in 
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the  population.   When  this  set  is  fixed,  the  problem  becomes  one  of  stratified 
sampling  with  /  strata.  The  variance  of  the/h  mean  Xi  is,  by  Eq.  (5.11.7) 

where  0/  =  (Mj  -  l)"1  £„  (Xja  -  Hj)2,  so  that 

The  expectation  of  this  for  any  choice  of  the  /  clusters  is 

(7.3.7)  E^\^=MM^{%-^) 

which  is  the  second  term  in  Eq.  (3).   The  first  term  represents  the  part  of  the 
variance  due  to  first-stage  sampling.  Since  E(MjXj\  Y)  =  Mj/ij9 


(7.3.8)  E(X\  Y)  =  —  £  Mjltj  =  -  Mjltj 

By  Eq.  (5.11.7), 


where 


so  that 


V(Mjitj)  =  k'2  (1//—  l/fc) 
Ki  =  [LMft  -  Mnlk)2]/(k  -  1), 


This  gives  the  first  term  in  Eq.  (3).  This  term  is  small  when  the  clusters  are  very 
much  alike  in  size  and  composition. 

*  7.4  Systematic  Sampling  Suppose  the  population  consists  of  the  elements 
Eu  E2  .  .  .  Em,  arranged  in  some  fixed  order.  Any  systematic  sample  consists 
of  the  elements  Eif  Ek+i,  E2k+i .  .  .  E(N_l)k  +  h  where  i  is  one  of  the  numbers 
1,  2  ...  k,  and  Nk  <  M.  Usually,  for  a  sample  of  size  N,  k  is  chosen  so  that  Aft 
is  as  near  to  M  as  possible. 

Systematic  sampling  divides  the  population  in  effect  into  strata,  each 
consisting  of  k  successive  units,  and  chooses  one  sampling  unit  per  stratum.  The 
choice  is  not  random,  however,  since  the  unit  chosen  occupies  the  same  ordinal 
position  in  each  stratum.  Since  a  systematic  sample  is  spread  evenly  over  the 
population,  it  often  gives  a  good  estimate  of  the  mean,  although  the  variance  of 
the  estimator  is  in  most  cases  greater  than  that  for  a  simple  random  sample. 
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If  Xa  is  the  variate  measured  on  the  ath  element  the  sample  mean  for  the  ith 
systematic  sample  is 

(7.4.1)  m,  =  *i  +  *»+i+ —  +*(»-i»+. 

Since  there  are  k  values  of  i,  all  equally  likely, 

m-        1     Nk 

(7.4.2)  «=ir«p. 

If  Nk  =  M,  this  expression  is  the  population  mean  n,  so  that  mt  is  an  unbiased 
estimator  for  /*.  Also, 

(7.4.3)  V(mt)  =  E(mt  -  fi)2 

Now  by  Eq.  (1),  Nmt  -  Np  =  (X,  -  ji)  +  (2Tk+l  -  /i)  +  .  .  .  +  (*<*-!)*+*  -f) 

so  that 

fc  JVfc  N-l  k(N-j) 

(7.4.4)  X  (Nm,  -  iV»2  =  £  (X.  -  /.)2  +  2  J      J    (X,  -  mX^+j*  "  A*) 

i=l  a=l  7=1      /?=1 

The  first  term  on  the  right-hand  side  is  (M  —  \)o2.  The  second  term  vanishes 
if  there  is  no  correlation  between  pairs  such  as  Xp  and  Xp+Jk,  separated  by  jk 
items.  Correlation  of  this  type  is  called  serial  correlation. 
If  the  items  are  serially  uncorrelated, 

*,/     x      (M  ~  V*2      M-\o2 

(7.4.5)  Vim.)  =  J-1—  = 

v        }  K    u  N2k  M      N 

The  corresponding  value  for  the  variance  of  the  mean  mr  of  a  random  sample  of 
size  N  is  (N'1  -  M_1)(72,  so  that 

V(mi)  _  M  -  1 
V(mr)  ~  M  -  N 

This  is  greater  than  1  for  any  N  >  1 .  However,  if  there  is  a  sufficiently  large 
negative  serial  correlation,  K(/wf)  may  be  less  than  V(mr).  See  [4]. 

*  7.5  Double  Sampling  It  is  sometimes  useful  to  take  a  preliminary  large 
sample  in  order  to  get  some  information  which  will  serve  as  a  basis  for  drawing  a 
sub-sample  for  further  investigation,  particularly  if  the  large  sample  can  be 
obtained  rather  cheaply.  The  information  is  used  for  stratification,  or  in  other 
ways,  in  order  to  increase  the  precision  of  estimates  from  the  smaller  and  more 
costly  sub-sample. 

For  instance,  suppose  we  are  concerned  with  a  variable  Ty  such  as  total  sales 
in  all  retail  stores  of  a  certain  type,  which  may  be  rather  hard  to  obtain.  The 
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large  stores  of  this  type  will  be  much  more  important  in  providing  an  estimate  of 
T  than  the  small  ones,  and  we  would  like  to  have  a  much  larger  sampling  fraction 
of  the  larger  stores.  As  a  preliminary  to  the  sampling  design  we  might  make  a 
survey  of  a  large  fraction  of  the  population  of  stores,  obtaining  only  simple 
information  on  size  (say  the  number  of  employees),  and  use  this  to  decide  on  the 
sub-sample  which  will  be  investigated  to  determine  T. 

Suppose  we  classify  the  stores  in  the  original  sample  of  TV  (from  a  population 
of  M)  as  large  (A^)  or  small  (A^).  The  corresponding  numbers  in  the  population 
are  Mx  and  M2.  If  the  sub-sample  consists  of  all  the  large  stores  and  n2  of  the 
small  ones  (sampling  fraction  n2/N2  =  1/&),  an  unbiased  estimator  of  T  is 

(7.5.1)  f  =j(T1+/cT2) 

where  7\  is  the  total  sales  for  the  Nt  large  stores  and  T2  is  the  total  for  the  n2 
small  stores.  Here  /  is  the  primary  sampling  fraction  N/M.  The  total  sub- 
sample  size  is  Nt  +  n2.    Equation  (1)  may  be  written 


(7.5.2)  t-% 


1*..  +  ?  Z*J 


where  Xu  is  the  sales  figure  for  the  ith  store  in  the  large  group  and  X2j  that  for 
the  y'th  store  in  the  sub-sample  of  the  small  group. 

The  expectation  of  t for  a  fixed  n2,  and  for  a  fixed  set  of  N2  units  from  which 
the  sub-sample  of  size  n2  is  picked,  is  given  by 

M[ Nl  N2       1 

(7-5.3)  E(f\n2,  N2)  =  -[ £XU  +  £  X2IJ 

M   N 

where  X{  is  now  the  value  for  the  ith  store  in  the  sample,  regardless  of  whether 
it  belongs  to  the  one  group  or  the  other.  Then 

(7.5.4)  E(t)=ElE(t\n2,N2y] 

'  M 

1=1 

The  variance  of  t,  as  shown  below,  is  given  by 

(7.5.5)  V(t )  =  ~  [(M  -  N)o2  +{k-  l)M2<r22] 
where 

^(M-D-'M.-aO2,     m  =  ^ 
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and 

M2  M2    X 

<j22  =  (M2  -  I)" »  £  (A'21  -  M2)2,      M2  =  I  tt' 

To  prove  this  we  need  Eq.  (A.  14.6)  of  the  Appendix,  namely, 

V(X)  =  E\V{X\Y)-]  +  F[£(X|  Y)] 

where  7  stands  for  a  fixed  set  of  N2  small  stores  and  the  fixed  number  n2.  The 
second  term  on  the  right  is  just  the  variance  of  the  right-hand  side  of  Eq.  (3), 

that  is  of  MX.   This  is  M2\-  -  —  )<r2,  which  is  the  first  term  of  Eq.  (5). 

\yv      Mf 

The  conditional  variance  of  fis 

r2xr    2 


(7.5.6)  ,(f|„2>iv2)=^(l_i.)S2 


N 


2//,         i\„    2 


=  ^r(k-i)s 


where  s22  =  (N2  -  1)_1  Yj=i  (X2j  ~  X2)2.  This  follows  from  Eq.  (2),  since  the 
first  term  is  constant  (under  the  stated  condition)  and  the  second  term  is 
(MN2)I N  times  the  sub-sample  mean  X2. 

The  expectation  of  Eq.  (6)  for  a  given  n2  and  given  k  (i.e.,  for  given  number 
N2  of  small  units  although  not  for  a  fixed  set  of  N2)  is  (N2/f2)(k  —  1)ct22,  and 
the  expectation  of  this  for  given  k  is 

k~1      imM      k-1      2NM2      /f       ,xAfw      2 

-^-  <722£(N2)  =  -^-  (722  -^i  =  (k  -  l)  -  M2d22 

which  is  the  second  term  of  Eq.  (5). 

The  optimum  allocation  of  sample  sizes  will  depend  on  the  relative  costs  of 
the  first  and  second  sample.  These  often  differ  quite  considerably.  The  original 
large  sample  may,  for  instance,  be  obtained  by  a  mailed  questionnaire,  and  a 
sub-sample  of  the  non-responders  may  be  followed  up  with  personal  interviews, 
which  are  considerably  more  expensive.  As  a  cost  function  (apart  from  fixed 
overhead  costs)  for  the  whole  survey,  we  might  assume 

(7.5.7)  C  =  NC0  +  NiCj  +  n2C2 

where  C0  is  the  cost  of  selecting  and  examining  a  unit  in  the  large  sample,  C1 
is  the  additional  unit  cost  for  the  large  units,  and  C2  is  the  additional  unit  cost 
for  sub-sampling  the  small  units.  We  want  to  find  the  optimum  values  of  N  and 
k  {=  N2/n2)  for  fixed  variance  and  minimum  cost.   From  Eq.  (7), 

(7.5.8)  E(C)  =  NC0+^M1C1+l^C2 

M  k    M 

We  wish  to  minimize  this,  subject  to  a  fixed  value,  say  e2,  for  the  variance  of  the 
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estimate  t.   The  method  is  to  use  a  Lagrange  undetermined  multiplier  X  (see 
Appendix  A.  15)  and  form  the  function 

(7.5.9)  F(N,  k,  X)  =  E(C)  +  X[V{t)  -  e2] 

Setting  8F/dN  and  dF/dk  equal  to  zero,  we  get 


r    M2    ,      MM2(k-  1)(T221      „ 


Ml   „  M2   ^  ,  T      M        2         MM2(/C  -  1)<72' 

iu;     c0  -f 
and 


NM2  MM2a22 

~Fm2+A_ N~~-° 

Eliminating  A/TV2  from  these  equations,  we  get 

_  C2M(72  -  C2M2(722  =  MC2  d2/(722  -  MJM 
{  '  '    )  C0Ma22  +  ClMla22       M,    Cx  +  C0M/Af x 

If  we  put  V(t )  =  e2  in  Eq.  (5),  we  get  an  expression  for  N,  namely, 

(7.5,2)  *=-^[1+(k-DM^ 

Ma    +  £  L 


M<7^  J 

where  &  is  given  by  Eq.  (11). 

If  we  took  a  simple  random  sample  of  N'  from  the  population  of  M,  large 
enough  to  give  the  same  variance  for  7  (now  MX),  we  should  have 

e2  =  V(MX)  =  M2V(X)  =  MV^  -  1) 
so  that 

MV 


(7.5.13)  N' 


Ma2  +  e2 


This  is  the  first  factor  in  the  expression  for  N,  Eq.  (12).  However,  although  N 
is  larger  than  AT,  the  cost  of  the  double  sample  is  less  than  that  of  the  propor- 
tionate single  sample. 

Example  1    Suppose  that 

M  =  20,000,        Mt  =  1,000,        (M2  =  19,000) 
ax2  =  500,  <722  =5,  a2  =  34 

(the  variances  are  estimates  from  a  preliminary  investigation),  and 

C0  =  0.25,        Ct  =  2,        C2  =  1     (dollars) 

Suppose  the  preliminary  estimate  of  T  is  29,000  units,  and  we  want  e  to  be  not 
more  than  0.04  of  this,  or  1160.  Then  by  Eq.  (13), 

N'  =  6710 
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The  cost  of  a  single  proportionate  sample  of  this  size  would  be  6710C0  +  335Q 
+  6375C2  =  $8723. 

From  Eqs.  (11)  and  (12)  we  find  k2  =  16.7,  so  that  k  «  4,  and  N  =  9520. 
This  gives  Nt  =  476,  n2  =  i(9520  -  476)  =  2261.  The  cost  of  the  double 
sampling  method  would  therefore  be  9520Q  +  476CX  +  2261 C2  =  $5593. 

This  is  considerably  cheaper  than  the  cost  of  a  single  sample  to  give  the  same 
precision. 

7.6  Sequential  Sampling  In  any  fixed-size  sampling  procedure  the  total 
number  of  items  in  the  sample  is  decided  beforehand  and  this  number  of  items 
is  drawn  and  examined.  However,  it  is  sometimes  practicable  and  economical 
to  draw  the  sample  items  one  at  a  time  and  examine  them  as  they  are  drawn. 
This  type  of  sampling  is  called  sequential. 

Suppose  a  certain  hypothesis  H0  regarding  the  parent  population  is  to  be 
tested  (for  instance,  the  hypothesis  that  the  proportion  of  defectives  in  a  large 
batch  of  machine  parts  is  not  greater  than/?0).  On  the  basis  of  the  first  m  sample 
items  tested  we  may  make  one  of  three  decisions:  (a)  to  accept  //0,  (b)  to  reject 
H0,  (c)  to  test  one  more  item.  The  process  is  terminated  when  our  decision  rule 
leads  us  to  either  (a)  or  (b).  The  expected  number  of  observations  required  to 
reach  one  of  these  two  decisions  is  less  than  we  would  need  in  order  to  make  the 
same  decision  on  the  basis  of  a  single  fixed-size  sample.  Of  course,  it  may 
happen  that  the  sequential  procedure  will  take  more  observations  than  the 
fixed-size  one  (although  this  is  unlikely)  and  it  may  not  always  be  convenient  in 
practice  to  take  the  samples  one  at  a  time,  but,  by  and  large,  sequential  sampling 
is  a  definitely  economical  procedure.  For  full  details,  Wald's  book  [5]  should  be 
consulted. 

Sequential  testing  may  be  illustrated  by  the  theory  of  the  random  walk. 
Suppose  B,  O,  A  are  three  points  on  a  straight  line,  where  A  is  a  paces  to  the 
right  of  O  and  B  is  b  paces  to  the  left.  If  I  start  at  O  and  take  one  pace  per 
second  in  a  random  direction  (backwards  or  forwards)  along  the  line,  how  long 
will  it  take  me  to  reach  either  A  or  Bl  This  is  the  random  walk  problem  in  a 
simple  form.  It  can  be  proved  that  the  walk  will  eventually  terminate.  The 
probability  of  oscillating  back  and  forth  without  ever  reaching  either  A  or  B  is 
zero.  In  the  sequential  decision  process  each  new  item  tested  is  like  a  pace  in  the 
random  walk — it  leads  towards  decision  (a)  or  decision  (b).  Eventually  one  of 
these  two  decisions  will  be  actually  reached. 

The  type  of  sequential  test  suggested  by  Wald  is  a  likelihood  ratio  test.  Let  us 
suppose  that  we  are  testing  a  simple  null  hypothesis  H0  against  a  simple  alternative 
Ht  (see  §  6.7).  Let  /o(^i)  be  the  probability  (or  probability  density)  that  the 
variable  X  takes  the  value  xl  when  H0  is  true,  and  similarly  /ifo)  =  P(X  = 
x^Hx).  The  joint  likelihood  of  the  given  set  of  m  observations  xt  x2,  •  .  .  xm 
under  H0  is 

Pom  =fo(Xl)'f0(x2)  •  •  ./0(*m) 
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and  the  joint  likelihood  under  H1  is 

Plm  =/i(*i)-/i(*2)  •  •  -/lOO 

The  test  suggested  is  to  calculate  plm/p0m  for  each  successive  value  of  m  and  to 
continue  testing  as  long  as  the  ratio  lies  between  two  specified  limits  A  and  B 
(A  >  B).  The  process  is  terminated  when  for  the  first  time  either plm/p0m  >  A  or 
PimlPom  ^  B-  In  the  first  case  H0  is  rejected,  and  in  the  second  case  it  is  accepted. 
If  we  put  zt  =  log  fiixi)  -  log  f0(Xi),  we  have 

(7.6.1)  log(^)  =  Ilog[£g|  =  SZi=Zm 

\POm/  i=l  L/oWJ  i=l 

and  the  test  is  terminated  as  soon  as  Zm  >  log  A  or  Zm  <  log  B.  Since  Zm  is  a 
sum  of  the  m  random  variables  zi9  the  analogy  with  a  random  walk  is  clear. 

The  values  of  A  and  B  are  determined  by  the  risks  we  are  prepared  to  take  in 
coming  to  the  one  decision  or  the  other.  If  the  probability  of  a  rejection  error 
(an  error  of  the  first  kind)  is  a  and  the  probability  of  an  acceptance  error  (second 
kind)  is  /?,  and  if  n  observations  lead  to  the  rejection  of  H0,  then  plnlp0n  = 
(1  —  /?)/a,  since  this  is  the  ratio  of  the  probabilities  of  H1  and  H0  for  a  sample 
which  leads  to  the  rejection  of  H0.   Therefore, 

(7.6.2)  — -  >  A 
Similarly, 

(7.6.3)  -$—  <  B 

1  —  a 

In  practice  we  usually  take  A  —  (1  —  /?)/a  and  B  =  /?/(l  —  a).  If  the  distribu- 
tion is  such  that  one  extra  observation  will  make  little  difference  to  the  value  of 
PiJPon,  there  will  be  no  appreciable  error  in  doing  this. 

*  7.7  Number  of  Observations  Required  for  a  Final  Decision  in   Sequential 
Sampling    Let  n  be  the  smallest  integer  for  which  Zn  >  log  A  or  Z„  <  log  B. 
We  would  like  to  find  the  expected  value  of  n  and  compare  it  with  the  fixed 
sample  size  TV  which  would  give  the  same  probabilities  a  and  /?  of  error. 
Since  Zn  =  zx  +  z2  +  .  .  .  +  zni  and  n  is  a  random  variable, 

(7.7.1)  E(Zn)  =  E[E(Zn\n)-]  =  £[n£(z)] 

=  E(n)-£(z) 

where  E(z)  is  the  expected  value  of  any  of  the  z{. 

If  the  test  leads  to  the  rejection  of  H0i  E(Zn)  will  be  nearly  log  A,  and,  if  the 
test  leads  to  the  acceptance  of  H0,  E(Zn)  will  be  nearly  log  B,  so  that 

(7.7.2)  E(Zn)  «  y  log  A  +  (1  -  y)log  B 
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where  y  is  the  probability  of  rejecting  H0  (y  will  be  a  if  H0  is  true  or  1  —  ft  if 
Ht  is  true).   From  Eqs.  (1)  and  (2), 


(7.7.3) 


E(n)« 


y  log  A  +  (1  -  y)log  £ 


Example  2    Suppose  the  variate  X  is  normally  distributed  with  unit  variance, 
and  that  under  H0  the  mean  is  n0  and  under  Ht  it  is  Mi(>Mo)-  Then 


(7.7.4) 


-l/2„-(*-/i0)2/2 


f0(x)=(2ny^e 
(/1(x)=(27i)-1/2e-(JC-^>2/2 


Let  ^(w),  /^(h)  be  the  expected  values  of  n  under  H0  and  7^  respectively. 
Then 


(7.7.5) 
From  Eq.  (4), 


E0(n)  = 
Ex{n)  = 


a  log[(l  -  ffl/g]  +  (1  -  oQlogEMl  -  «)] 

*o(*) 
(1  -  j8)log[(l  -  /?)/a]  +  p  logpg/d  -  a)] 

£x(z) 


z  =  log/^x)  -  log/o(x) 


Mi2 -Mo2 


+  *(Mi  -  Mo) 


Under  H0i  E(x)  =  /i0,  so  that 

(7.7.6)  E0(z)  =  -Ml    =^S    +  /*o(/«i  -  Mo) 

=  -i(Mi"Mo)2 

and  similarly,  Ex(z)  =  i(Mi  —  Mo)2-   Therefore,  from  Eq.  (5),  E0(n)  and  Et(n) 
may  be  found. 

Now  if  N  is  the  fixed  sample  size  corresponding  to  the  same  values  of  a  and 
p,  and  we  use  the  statistic  X  (which  is  normally  distributed  with  mean  n0  and 
variance  1  jN  when  H0  is  true), 

<x  =  (2;r)-1/2  I  V'2'2  <fc,        A0  =  VMc  -  /i0) 

•Mo 

where  c  is  the  critical  value  for  X.  Similarly, 

1  -  P  =  (27i)"1/2  |  °V'2/2  <fr,        A!  =  VN(c  -  a*i) 
Therefore  a  =  1  -  0>(A0),  0  =  O^)  and 


(7.7.7) 


N  = 


(Mi  ~  Mo): 
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From  Eqs.  (5),  (6)  and  (7)  it  appears  that  E0(n)/N  and  E^rij/N  are  independent 
ofX/ij  —  n0)2  and  so  may  be  calculated  for  any  given  a  and  /?.  Thus  for  a  =  0.05, 
P  =  0.1,  we  have 

X0  =  1.645,        X,  =  - 1.282,        (A0  -  XJ?  =  8.57, 

E0(n)  _  0.05  log  18  +  0.95  log  0.1053 
N  -i(8.57) 

=  0.465 
Ex{n)  _  0.9  log  18  +0.1  log  0.1053 
N  K8.57) 

=  0.555 

There  is  an  expected  saving  of  53.5  %  if  H0  is  true  or  44.5  %  if  Hl  is  true,  in  the 
number  of  observations  required. 

A  lower  limit  may  be  calculated  for  the  probability  that  the  sequential 
process  will  terminate  before  n  reaches  some  preassigned  number  nQ.  If  this 
probability  is  P0(n0)  under  H0  and  Pi(n0)  under  Hl9  then  it  may  be  shown  that 

Po(n0)>^(S0),        PM^l-QXSJ 
where 

log  B  -  n0EQ(z)  log  A  -  UqE^z) 

O0=-~ 7= ,  di  = = 

Vrc0(70(z)  yJyiQO^z) 

and  cr0(z),  o^z)  are  the  standard  deviations  of  z  under  hypotheses  H0  and  Hl. 

In  Example  2  above,  a0(z)  —  ox{z)  =  \ix  —  fi0  since  the  standard  deviation 
of  X  is  1 .  Therefore, 

log  0.1053 +^(^-^0)2 


'o 


Vn0Oi  -  Mo) 


log  18 -f  (^-fi0)2 

Sl= —^ 

Vw0(Mi  "Mo) 

With  a  =  0.05  and  ft  —  0.1  and  a  fixed  sample  size  of  1000,  we  could  detect 
a  difference  /^  -  /i0  amounting  to  2.927/Vl000  =  0.0926  (see  §6.10). 

With  n0  =  1000,  S0  =  0.694,  5t  =  -0.476,  so  that  P0(n0)  >  0.756,  and 
^i(«o)  ^  0.683.  The  probability  is  therefore  at  least  0.68  that  a  sequential  test 
of  this  kind  will  terminate  in  a  decision  for  acceptance  or  rejection  of  H0  before 
the  sample  size  reaches  1000. 

7.8  The  Truncated  Sequential  Test  If  it  happens  that  the  test  is  still  not 
terminated  for  some  n0  beyond  which  it  is  not  convenient  to  continue  testing,  a 
reasonable  decision  rule  is  the  following : 

If  Z„0  <  0,  accept  H0;  if  Z„0  >  0,  accept  Hv 
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The  probabilities  of  the  two  kinds  of  error,  cc(n0)  and  P(n0)  under  this  rule 
are  slightly  different  from  a  and  /?.  It  may  be  shown  that 

(7.8.1)  a(«0)  <  a  -  <&(vi)  +  <J>(v2) 

£K)<0+O(v3)-(D(v4) 


where 


$o  -    /—     ;     =  -\ln0E0(z)ja0(z) 
V«0<w) 


log  ,4 

V2  =  V,   + 


yJn0(T0(z) 


log  A  I — 

<5i  -    /—     ;  -  =  -VMi(#iW 

sJUoG^z) 


logB 

V4  =  V3  +  n=r 


These  are  upper  bounds  and  probably  higher  than  necessary.  For  n0  =  1000 
and  the  data  of  Example  2,  vt  =  -v3  =  1.464,  v2  =  2.451,  v4  =  -2.233. 
These  values  give  a(n0)  <  0.114,  P(n0)  <  0.159. 

If  N  =  100,  and  we  decide  to  stop  at  n0  =  300,  whatever  happens,  the 
upper  bounds  for  a(«0)  and  P(nQ)  are  0.052  when  a  =  0  =  0.05,  so  that  trun- 
cating in  this  way  would  make  very  little  difference  to  the  probabilities  of  error. 

7.9  The  Sequential  Test  for  a  Binomial  Distribution  We  assume  that  the 
objects  in  a  large  group  (a  "lot"  in  the  language  of  sampling)  can  be  classified 
as  either  "defective"  or  "satisfactory".  A  lot  will  be  acceptable  if  the  proportion 
of  defectives  6  <  6',  but  otherwise  the  buyer  will  want  to  refuse  it.  It  is  supposed 
that  the  buyer  has  to  make  a  decision  on  the  basis  of  a  sample,  and  therefore  he 
may  make  either  of  the  two  kinds  of  error  we  have  discussed  previously.  He  may 
refuse  a  good  lot  or  accept  a  bad  one,  and  must  decide  what  risks  he  is  prepared 
to  run  of  making  either  of  these  mistakes.  Suppose  he  decides  that  it  would  be  a 
serious  matter  to  refuse  a  lot  with  6  <  60  and  that  it  would  be  unfortunate  to 
accept  a  lot  with  6  >  6l9  where  of  course  0O  <  0'  <  6t.  He  will  want  to  keep 
the  probability  of  committing  these  serious  errors  down  below  say  a  and  /?, 
respectively,  where  both  these  numbers  are  fairly  small  compared  with  1 .  Having 
decided  on  0O>  ®u  a  anc*  /?,  he  can  construct  a  sequential  test. 

Randomly  selected  items  from  the  lot  are  taken  one  at  a  time  and  inspected. 
Suppose  the  number  of  defectives  in  the  first  m  units  tested  is  dm.  Then 


(7.9.1) 


Pim      0i'"(l-0i> 


m  —  dr, 


will  give  a  likelihood  ratio  test  for  hypothesis  H0  (that  0  =  0O)  against  hypothesis 
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H1  (that  0  =  0j).  If  0  <  0O  the  probability  of  rejecting  the  lot  will  be  even  less 
than  for  0  =  0O,  and  if  0  >  6X  the  probability  of  accepting  the  lot  will  be  less  than 
for  0  =  0t.  The  same  test  may  therefore  be  regarded  as  a  test  for  the  composite 
hypothesis  0  <  0O  against  the  composite  hypothesis  6  >  0X. 


The  test  consists  in  rejecting  H0  if  Z„ 


log  (  — |  >  log -,  accepting 

\PomJ  « 


P 


-,  and  continuing  the  test  if  neither  of  these  is  true.  This  is 


#oifZm<logl 

equivalent  to  setting  up  an  acceptance  number  Am  and  a  rejection  number  Rm 
for  each  value  of  m  and  continuing  the  test  as  long  as  Am  <  dm  <  Rm.  The 
numbers  Am  and  Rm  are  given  by 


log h  m  log 


(7.9.2) 


1-a 


i-e1 


log  —  +  log 


l-0i 


log 


\-p 


+  mlog 


(7.9.3) 


Rm  = 


1-Oq 

i-e1 


log  —  +  log 


0, 


i-e, 


Since  ^4m  and  Rm  depend  linearly  on  m,  they  define  a  sloping  band  of  constant 
width  on  a  diagram  with  dm  plotted  against  m.  For  a  —  P  =  0.05,  0O  =  0001 
and  0X  =  0.03,  the  lines  representing  Am  and  Rm  as  functions  of  m  are 


(7.9.4) 


^m  =  0.00859m  -  0.858 
Rm  =  0.00859m  +  0.858 


Fig.  38    Sequential  binomial  test 
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In  Figure  38  an  imaginary  sampling  experiment  is  represented  by  the  stepped 
line.  In  the  first  81  items  tested  there  were  no  defectives,  the  82nd  was  defective, 
the  second  defective  turned  up  at  the  152nd  test,  the  third  at  the  221st.  This  last 
test  took  the  cumulative  polygon  outside  the  rejection  line  and  therefore  the  lot 
was  refused. 

A  lot  under  this  scheme  will  be  accepted  if  the  first  100  items  tested  show  no 
defectives  and  will  be  refused  if  a  defective  appears  in  the  first  17.  If  only  one 
defective  has  appeared  in  212,  the  lot  will  be  accepted;  if  two  appear  in  the  first 
132  it  will  be  refused;  and  so  on. 

The  probability  Pe  of  accepting  the  lot  for  any  given  0  can  be  expressed  as  a 
function  of  0.  It  decreases  from  1,  when  6  =  0,  to  0  when  0=1.  When  0  =  0O, 
Pe  =  1  —  a  and  when  0  =  0l9  Pe  =  /?.  If  0O  and  0t  are  not  too  far  apart,  the 
approximate  value  of  Pe  is  given  by 

(7.9.5)  Pe  »  (Ah  -  l)l(Ah  -  Bh) 

where  A  =  (1  —  /?)/a,  B  =  /?/(l  —  a)  and  h  is  the  non-zero  root  of 

(a   \h  /i    n    \h 

^)+(1-e)(r^)=1 


that  is,  of 


(7.9.7)  0  = 


(Lzh)1 


U         \1  -  0<J 


By  choosing  various  values  of  h,  we  can  calculate  corresponding  values  of 
0  and  Pe  and  plot  the  curve.  This  is  a  sort  of  operating  characteristic  or  power 
curve  of  the  test.  It  indicates  the  probability  of  accepting  a  lot  with  any  given 
proportion  of  defectives.  With  the  data  assumed  above  there  is  an  even  chance 
of  accepting  a  lot  with  0  =  0.009. 

The  expected  number  of  observations  n  necessary  to  reach  a  final  decision, 
one  way  or  the  other,  is  given  approximately  by 

(7.9.8,  £,.)«        M°8» -Ml -*.>"*" 


As  a  function  of  0,  this  starts  at  100  for  0  =  0,  rises  slightly  and  then  decreases 
as  0  increases,  becoming  1  for  0  =  1.  For  0  =  0.02,  E{n)  =  53  and  for  0  =  0.03, 
E(n)  =  36  (using  the  data  of  Figure  38).  The  function  is  indeterminate  at 
pe  =  0.5  (0  =  0.0086). 

7.10  Tolerance  Limits    Tolerance  limits  are  limits  within  which  we  are 
confident  that  at  least  a  specified  proportion  of  the  population  will  lie  (with  of 
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course  a  specified  degree  of  confidence).  We  may  for  instance  claim  with  95  % 
confidence  that  at  least  90  %  of  a  particular  population  will  have  values  of  X 
between  given  limits.  If  these  limits  are  the  smallest  and  the  greatest  values 
observed  in  a  sample  of  TV,  we  may  ask  how  large  TV  should  be  to  justify  the  claim. 
If  xt  and  xN  are  the  least  and  greatest  values  of  X  for  the  sample,  and  if  f(x) 
is  the  density  function  for  X,  the  proportion  of  the  population  lying  between 
xx  and  xN  is 

rxN 

(7.10.1)  v=\     f(x)dx 

J  Xi 

The  density  function  for  v  is 

(7.10.2)  g(v)  =  N(N  -  l)uN~2(l  -  u),        0  <  v  <  1 

The  probability  that  v  >  p  is  therefore 

i 
g(v)  dv 

and  for  a  confidence  coefficient  of  1  —  a  this  probability  is  1  —  a.  Integrating 
Eq.  (2)  we  obtain 

(7.10.3)  a  =  NP"-1  -  (TV  -  l)pN 

from  which  TV  can  be  obtained  for  given  a  and  p.  For  a  =  0.05  and  p  =  0.99, 
we  find  TV  =  473.  This  means  that  if  we  take  a  random  sample  of  size  473  from  a 
population  in  which  X  is  distributed  continuously,  there  is  a  probability  0.95 
that  at  least  99  %  of  the  population  will  have  X  values  between  the  least  and  the 
greatest  values  found  in  the  sample.  This  result  is  independent  of  the  form  of  the 
distribution. 


PROBLEMS 
A.  (§  7.2) 

1.  Households  in  a  town  are  stratified  into  a  high-rent  stratum  (4,000  items)  and 
a  low-rent  stratum  (20,000  items).  The  variate  X,  of  which  the  average  is  to  be  estimated, 
is  thought  to  have  a  standard  deviation  in  the  first  stratum  about  three  times  that  in 
the  second.  How  should  a  total  sample  of  1000  be  divided  between  the  two  strata? 

2.  The  farms  in  a  certain  county  are  stratified  according  to  size  in  seven  strata,  as 
shown  in  the  table  below.  For  the  variate  X  (the  number  of  acres  in  corn)  the  stratum 
means  \x.%  and  the  stratum  standard  deviations  a  are  as  given.  If  it  is  required  to  take 
a  sample  of  100  farms  for  estimating  some  quantity  closely  related  to  X,  how  should 
these  farms  be  allocated  among  the  strata  (a)  with  proportionate  sampling,  (b)  with 
optimum  sampling?  Compare  the  precision  of  each  of  these  methods  with  that  of 
simple  random  sampling. 


170 


INTRODUCTION  TO  STATISTICAL  INFERENCE 


Farm  Size  (Acres) 

No.  of  Farms 

fit 

Oi 

0-40 

394 

5.4 

8.3 

41-80 

461 

16.3 

13.3 

81  -  120 

391 

24.3 

15.1 

121  -  160 

334 

34.5 

19.8 

161-200 

169 

42.1 

24.5 

201  -  240 

113 

50.1 

26.0 

241- 

148 

63.8 

35.2 

Hint:  The  precision  varies  inversely  as  the  variance.  The  variance  of  the  mean  of  a 
simple  random  sample  is  ct2(M  —  N)/(MN),  where  a2  is  the  overall  variance  of  X. 
This  can  be  found  from  the  en  and  fit  by  the  formula:  (M  —  l)a2  =  J\  (Mi  —  \)a2  + 

E,  MiifJLi    -  /X)2,  fJL   =  ^MifJLi/M. 

3.  Prove  that  if  Vr(X)  is  the  variance  for  a  random  sample  and  VV(X)  that  for  a 
proportionate  sample,  then 


Vr(X)  -  VP(X) 


M  -  N 


MN(M  -  1) 


£  MtifXi  -  ^y 


2  (M  -  MtW 


4.  A  variate  X  is  distributed  in  the  population  with  density  e~x,  x  >  0.  The 
population  is  divided  into  two  strata  at  the  point  xo  and  a  stratified  sample  of  size  N  is 
taken  with  proportionate  sampling.  Show  that  the  variance  of  the  sample  estimator  X 
of  the  population  mean  is  7V_1[1  —  (xo2e~Xo)/(\  —  e~Xo)],  and  find  for  what  value  of 
xo  this  is  least.  Hint:  The  population  is  infinite,  so  that  the  sampling  fraction /is  zero. 
The  ratios  Mi/M  and  M2/M  are  given  by  the  integral  of  e~x  from  0  to  xo  and  from  xo 
to  00,  respectively. 

B.  (§  7.3) 

1.  From  the  following  artificial  population  with  three  clusters,  suppose  that  two 
clusters  are  selected  and  two  units  are  selected  from  each  cluster.  Find  the  variance  of 
the  unbiased  estimator  for  [x.  If  the  sampling  is  proportional  to  cluster  size  (one  item 
from  cluster  1 ,  two  from  2,  three  from  3)  what  is  the  variance  ? 


Cluster  No.  (/) 

Xioc 

Mi 

1 
2 
3 

0,  1 

1,  2,  2,  3 
3,  3,  4,  4,  5,  5 

2 
4 
6 

Note  that  the  clusters  are  widely  dissimilar,  so  that  the  first  term  in  Eq.  (7.3.3)  is  much 
the  larger  of  the  two. 

2.  If  the  sub-sample  number  for  each  cluster  sampled  is  proportional  to  the  size 
of  that  cluster,  show  that  the  estimator  of /x  reduces  to  kT/(Mlf)  where/is  the  common 
sampling  fraction  Nj/Mj  and  T  is  the  total  of  the  Xjh  for  all  the  items  in  the  combined 
sample. 

3.  An  alternative  method  of  selecting  the  /  clusters  is  to  sample  with  probabilities 
proportional  to  cluster  size.  That  is,  the  probability  Zi  of  selecting  the  Ith  cluster  is 
Mi/M.  (Note  that  this  means  sampling  with  replacement.  The  same  cluster  may  appear 
more  than  once  in  the  sample.) 

If  the  sub-sample  size  is  the  same  (N)  for  each  cluster  sampled,  show  that  T/Nl  is 
an  unbiased  estimator  of  /m,  where  T  is  defined  as  in  Problem  2.  Hint:  Let  Tj  be  the 
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total  for  the  yth  cluster.  Show  that,  for  a  given  value  of/,  E(Tj)  =  Np.j  and  that,  over 
all  /,  E(fii)  =  /a. 

4.  The  variance  of  the  estimator  in  Problem  3  is 

£  [MtiiH  -  ft)*  +  (Mt  -  N)crt2/N)/(IM) 

i 

Show  that  with  the  data  of  Problem  1,  this  estimator  has  a  considerably  smaller 
variance  than  either  of  those  in  Problem  1. 

C.  (§  7.4) 

1.  Show  that  the  variance  of  the  mean  of  a  systematic  sample  may  be  written 


V(mi)  =  (M  -  1) 


M 


(N-\) 


N 


where 


J* 


(Xi}  -  rrnY 


k(N  -  1) 

which  is  the  average  of  the  variances  within  the  separate  systematic  samples.   Hint: 
Ya  (Xa  ~  H)2  =  Ha  (Xtj  ~  "»<)a  +  Z«  (™  ~  /*)2,  and  this  last  term  is  N^t  (m<  - 

2.  Prove  from  the  result  of  Problem  1  that  V(na)  <  V(mr)  if  and  only  if  sw*  >  <r2, 
where  mr  is  the  mean  of  a  random  sample  of  size  N.  This  result  indicates  that  systematic 
sampling  is  more  precise  than  random  sampling  when  the  variance  within  a  sample 
tends  to  be  larger  than  that  in  the  whole  population,  that  is,  when  the  sample  is  markedly 
heterogeneous. 

3.  The  following  table  exhibits  an  artificial  population  with  a  fairly  steady  rising 
trend;  here  M  =  40,  Af  =  4,  k  =  10,  and  each  column  is  a  separate  systematic  sample. 
Calculate  the  estimator  for  /x  from  each  of  these  ten  samples.  Find  the  variance  of  these 
estimators  and  compare  with  the  average  variance  within  samples  and  the  variance  of 
the  mean  of  a  random  sample  from  the  given  population. 


Systematic  Sample  Numbers 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

0 

1 

1 

2 

5 

4 

7 

7 

8 

6 

6 

8 

9 

10 

13 

12 

15 

16 

16 

17 

18 

19 

20 

20 

24 

23 

25 

28 

29 

27 

26 

30 

31 

31 

33 

32 

35 

37 

38 

38 

4.  Calculate  the  serial  correlation  coefficient  pu  for  a   lag  of  k,   defined   by 
k(N 


k(N-l) 

\)a2pk  =     ^     (Xfi—  n)(Xf)+k  —  fi) 
0=1 


for  the  data  of  Problem  3.  Hint: 

E*  {Xfi  -  MXfi+*  -n)=T,p  (xpxfi+k)  -  MZ  Xfi  +  £  x,+t)  +  Kn  -  l)^ 

5.  If  the  serial  correlation  coefficient  pjk  for  a  lag  of  jk  is  defined  by 


k(N  -j)<J2-pjk 


k(N-j) 

I     (Xp 


H>)(XP 


+}k 


/*) 


and  if  pjk  =  (pk)*,  where  pk  is  the  coefficient  for  a  lag  of  k,  show  that  when  terms  of 
order  \/N2  are  neglected,  the  ratio  of  the  variances  of  the  two  estimators  mt  and  mr  is 
given  approximately,  for  large  M,  by  V(mi)IV(mr)  =  (1  +  />*)/(!  —  pk). 
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D.  (§  7.5) 

1.  In  a  double  sampling  scheme  where  the  items  are  stratified  in  two  classes  (e.g., 
large  and  small),  the  first  sample  of  N  produces  Ni  large  and  N*  small  items.  A  random 
sub-sample  of  m  is  drawn  from  the  Ni  items  and  an  independent  random  sample  of 
«2  from  the  Nz  items,  and  the  variate  X  is  measured  on  these  sub-samples.  If  7i,  T2, 
are  the  totals  of  X  for  the  m  and  «2  items,  and  7Ms  the  total  for  the  whole  population, 
show  that  an  unbiased  estimator  of  Tis  t ■—  (hTi  +  kT2)/fand  that  its  variance  is 

f   =  (M  -  N)o*  +  (h-  l)Miai2  +  {k-  1)M2(722 

where  h  =  Nifni ;  k  =  N2/W2,  f  =  N/M. 

2.  Show  that  the  expression  for  the  variance  in  Problem  1  may  be  written,  if  we 
ignore  the  differences  between  M  and  M  —  I;  Mi  and  Afi  —  1,  M2  and  Aft  —  1, 

V(f)  =  i-^[Afl(/Lt!    -  /*)«   +   M2(/*2   -  /x)2]   +   MltTl2  ^—-^  +   M2C722  ^LzD 

(See  the  Hint  following  Problem  A-2). 

3.  Suppose  the  farms  in  Problem  A-2  are  divided  into  two  classes,  called  large  (over 
160  acres)  and  small  (160  acres  or  under).  A  first,  comparatively  cheap,  sample  of  200 
is  taken  and  this  gives  40  large  and  160  small  farms.  The  variate  X  is  measured  on  a 
subsample  of  30  of  the  large  farms  and  50  of  the  small  ones.  Calculate  the  variance  of 
T  for  the  double  sample  and  compare  it  with  the  variance  measured  on  a  random 
sample  of  100  from  the  original  population. 


Farm  Size 

Mi 

LLi 

<7i2 

Affcr*2 

Large 
Small 

430 
1580 

51.6 
19.4 

922 
312 

396,500 
493,000 

Population 

2010 

26.3 

617 

1,239,300 

4.  Find  expressions  for  the  optimum  values  of  N,  h  and  k,  using  the  method  of 
§  7.5  and  supposing  that  only  m  of  the  Ni  large  units  obtained  in  the  first  sample  are 
sub-sampled  (with  additional  cost  Ci)  and  that  h  =  Ni/ni.  Hint:  Differentiate  F 
partially  with  respect  to  AT,  h  and  k. 

5.  Apply  the  results  of  Problem  4  to  the  data  of  Problem  3,  assuming  that  Co  =0.1, 
Ci  =  0.8,  C  =  1.0,  and  calculate  the  values  of  h  and  k  for  optimum  sampling.  If  the 
standard  deviation  of  the  estimator  t  is  not  to  exceed  2650  acres,  calculate  the  size  of 
the  primary  sample  and  the  expected  cost  of  getting  the  required  information  with 
double  sampling.  Calculate  also  the  size  and  expected  cost  of  a  simple  random  sample 
from  the  population  to  give  the  same  precision  of  estimation. 

E.  (§§  7.6-7.10) 

1.  Suppose  that  in  a  certain  population  the  probability  6  that  an  individual  is 
defective  is  either  0.1  or  0.3,  but  cannot  have  any  other  value.  We  wish  to  test  the 
hypothesis  Ho  that  6  =  0.1  against  the  alternative  Hi  that  6  =  0.3,  on  the  basis  of  a 
fixed-sample  test.  The  test  consists  in  accepting  Ho  if  the  number  of  defectives  du  in  a 
sample  of  size  Af  is  less  than  k\  otherwise  we  accept  Hi.  Find  Af  and  k  if  the  risks  of 
error  are  a  =  0.02  and  j8  =  0.03. 

2.  Construct  a  sequential  acceptance-and-rejection  chart  for  Problem  1  above. 

3.  Calculate  the  approximate'expected  number  of  trials  before  a  decision  is  reached 
by  the  method  of  Problem  2.  Hint:  Use  Eq.  (7.9.8),  first  assuming  Ho  and  then 
assuming  Hi. 

4.  Perform  an  imaginary  sampling  experiment  from  the  population  of  Problem  1, 
as  follows :  read  off  a  set  of  one-digit  random  numbers  (say  a  column  from  the  table 
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in  Appendix  B.  1),  regarding  each  digit  as  a  sample  item.  Count  each  zero  as  a  defective 
(this  corresponds  to  hypothesis  Ho).  Continue  until  a  decision  is  reached,  using  the 
chart  constructed  in  Problem  2,  and  note  the  number  of  trials  necessary  to  reach  the 
decision.  Repeat  20  times,  using  different  sets  of  random  numbers,  and  note  the  average 
number  of  trials  required.  Compare  with  the  result  of  Problem  3  for  Ho. 

5.  Suppose  the  proportion  of  defectives  in  a  population  can  vary  from  0  to  1,  but 
that  acceptance  limits  are  fixed  at  0.1  and  0.3.  Construct  the  operating  characteristic 
(or  power  curve)  of  the  binomial  sequential  test,  with  a  =  0.02,  /?  =  0.03. 

6.  Construct  a  sequential  acceptance-and-rejection  chart,  for  testing  the  binomial 
probability  6  =  0.5  against  the  alternative  6  =  0.7,  given  the  risks  of  error  as  a  =0.1, 
p  =  0.2.  If  the  following  table  represents  the  results  of  a  sequence  of  trials,  xm  being 
the  number  of  successes  in  m  trials,  show  graphically  that  the  sampling  terminates  with 
a  decision  in  favour  of  6  =  0.5  at  the  10th  trial. 


m 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Xm 

0 

0 

1 

1 

2 

3 

3 

4 

4 

4 

7.  Construct  a  sequential  test  for  the  mean  [l  of  a  Poisson  distribution,  to  test 
\l  =  fxo  against  /u,  =  /ai(/xi  >  no).  Find  the  expected  sample  size  and  the  power  function 
of  this  test  for  given  a  and  p.  Hint:  The  power  function  for  any  value  of  /x  is  given  by 
Pft  =  (Ah  —  \)/(Ah  —  Bh),  where  h  is  the  non-zero  number  for  which 

00 

Z  [/>(*,  pi)lp(x,  no)]hp(x,  n)  =  i. 

x  =  0 

Here  p(x,  /x)  is  the  Poisson  probability  for  x  successes  with  parameter  /a.  Show  that  h 
is  given  by  the  relation  ^  +  (^i  —  po)h  =  (/xi/jLt0)V- 
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Chapter  8 

EXACT  TESTS  ON  SAMPLES  FROM  A 
NORMAL  POPULATION 

8.1  The  Assumption  of  Normality  We  have  already  obtained  in  Chapter  5 
some  moments  of  the  distribution  of  the  sample  mean  and  variance  in  samples  of 
size  N  from  a  finite  population,  and  we  have  noted  the  simplication  in  these 
results  when  the  parent  population  is  supposed  to  be  infinite  in  size  and  normal 
in  distribution.  Thus  the  expectation  and  variance  of  kt  are  given  by 


(8.1.1) 


where  n  and  a  are  the  parameters  of  the  normal  parent  distribution.  Also, 

(  E(k2)  =  k2  =  <r2 
2<r4 


(8.1.2) 

(N-l) 

furthermore,  kx  and  k2  are  uncorrelated,  i.e., 


(8.1.3)  C(/c1,/c2)=0 

The  skewness  and  kurtosis  of  the  distribution  of  kl  were  shown  to  be  zero, 
which  suggested  that  the  distribution  of /^  might  be  normal,  but  the  methods  of 
Chapter  5  were  unsuitable  for  finding  exact  distributions.  Straightforward 
methods  of  finding  density  functions  for  statistics  often  lead  to  intractable 
mathematical  expressions,  but  when  the  parent  population  is  normal,  the 
calculations  are  relatively  simple,  and  a  good  deal  of  work  has  been  done  using 
the  basic  assumption  of  normality.  Throughout  this  chapter  we  assume,  unless 
otherwise  stated,  a  normal  parent  population. 

The  most  popular  tests  among  practical  statisticians  are  tests  which  depend 
on  normality  in  the  parent  population,  and  these  tests  are  often  used  in  situ- 
ations where  the  assumption  of  normality  is  decidedly  dubious.  Fortunately, 
however,  the  tests  are  usually  quite  robust,  which  means  that  considerable 
departures  from  normality  will  not  affect  them  very  much.  When  there  is  grave 
doubt  about  the  assumption,  non-parametric  (or  distribution-free)  tests  should 
be  used,  even  at  some  sacrifice  of  power. 

174 
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8.2  The  Distribution  of  the  Sample  Mean  The  fact  that  the  distribution  of 
the  sample  mean  is  normal  when  the  population  is  normal  is  easily  demonstrated. 
We  saw  in  §  2.8  that  the  cumulant  generating  function  for  a  linear  function 
L(=  Y,j  CjXj)  of  independent  variates  Xj  is  given  by 

(8.2.1)  «lW-E«/C>) 

j 

where  Kj(h)  is  the  c.g.f.  for  Xj.  If  all  the  Xj  are  normal  with  mean  \i  and  variance 
<j2,  andifL  =  Z  =  J^XJN, 

N  /    h      G2h2\ 

(8.2.2)  ™-%fcii+-m) 

,         <72   h2 

and  this  is  the  c.g.f.  for  a  normal  distribution  with  mean  \i  and  variance  <r2/N. 
On  the  assumption  that  a  distribution  is  uniquely  determined  by  its  c.g.f.,  this 
proves  the  normality  of  L.  The  distribution  of  the  variance  and  higher 
moments  is  not,  however,  so  easily  obtained. 

8.3  The  Distribution  of  the  Sample  Variance  One  way  of  arriving  at  the 
distribution  of  the  variance  is  to  find  the  joint  distribution  of  the  mean  and 
variance,  and  then  integrate  over  all  possible  values  of  the  mean.  For  simplicity 
we  will  choose  the  origin  for  X  in  such  a  way  that  the  population  mean  is  zero. 
This  will  have  no  effect  at  all  on  the  distribution  of  the  variance.  If  the  N 
observed  sample  values  of  X  are  xu  x2  . . . .  xNi  the  joint  density  function  for  the 
sample  is 

(8.3.1)  /(*!,  x2...xN)=  (2nG2yNI2e-Wl2<f2 

Now    £  x?  =  £(*,-*  +  x)2  =YJ(xi-  x)2  +  Nx2  +  2Jc  X  (**  -  x) 
where  x  is  the  sample  mean.   Since  ]T  (xt  —  x)  =  0  and  J]  (x(  —  x)2  =  Nm2, 
where  m2  is  the  second  sample  moment  about  the  mean,  we  have 

(8.3.2)  X  xt2  =  N(x2  +  m2) 
Therefore  (1)  may  be  written 

(8.3.3)  f(xl9  x2...xN)=  (2na2rN/2e-N^2+m2)/2<r2 

Since  m2  is  proportional  to  the  sample  variance  k2,  the  relation  being 

(8.3.4)  Nm2  =  (N  -  \)k2 

the  distribution  of  k2  is  easily  obtainable  from  that  of  m2. 

There  are  two  methods  of  proceeding  in  a  problem  like  this — one  is  the 
analytical  method,  which  involves  a  good  deal  of  algebraic  manipulation ;  the 
other  method  is  geometrical  and  requires  considerable  spatial  intuition.  Fisher's 
original  approach  was  geometric,  but  various  writers  since  have  given  analytical 
ofs.  Both  promethods  are  explained  in  detail  in  [1]. 
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In  the  analytical  method  we  change  the  set  of  variables  xu  x2  . . .  xN  to  a 
new  set,  of  which  two  will  be  x  and  m2,  the  variables  which  appear  in  Eq.  (3). 
We  therefore  need  N  —  2  new  variables,  wl9  w2  . .  .  wN_2>  which  can  be  chosen 
in  the  most  convenient  way  and  which  will  later  disappear.  Then 

(8.3.5)  /(*!,  x2  .  .  .  xN)  dxt  dx2  .  .  .  dxN 

=  0(wl5  w2  •  .  .  Wjv_2,  3c,  m2)  dwt  .  .  .  dx  dm2 

Since,  by  Eq.  (3),  f(xu  . . .  xN)  is  already  expressed  in  terms  of  the  new  set, 
we  merely  have  to  work  out  the  relation  between  the  differentials.  This  is 

(8.3.6)  dxx  .  .  .  dxN  =  \j\  dwt  .  .  .  dx  dm2 

where  J  is  the  Jacobian  of  the  old  variables  with  respect  to  the  new  (see  Appendix 
A.4).   For  a  certain  particular  choice  of  the  w's,  we  find 

(8.3.7)  J=iN<"-l>/2m2<"-3)/2D 

where  D  is  an  expression,  in  the  form  of  a  determinant,  depending  only  on  the 
h>'s.  It  follows  that 

N(N-l)/2  r         N  n 

(8.3.8)  g(wl9  w2...x,  m2)  =  ^^yv/2  \D\m2(N-3)/2exp[-—2(x2  +  m2)J 

If  we  now  integrate  over  all  possible  values  of  the  w's  (this  integration  does  not 
actually  have  to  be  carried  out),  we  know  that  the  result  must  be  of  the  form 

(8.3.9)  h(x,  m2)  =  Cm2(Ar~3)/2  exp[--^  (*2  +  m2)l 

where  C  is  some  constant  depending  on  the  bounds  of  integration  of  the  w's 
and  on  the  constant  factors  in  Eq.  (8).  We  do  not  need,  at  this  stage,  to  know 
exactly  what  it  is.  Since  this  joint  distribution  is  of  the  form/! (3c)  multiplied  by 
/2(/w2),  it  is  clear  that  3c  and  m2  are  independent  variates.  To  obtain  the  distri- 
bution of  m2  we  simply  have  to  integrate  over  all  possible  values  of  x  (—  oo  to 
+  oo).  This  gives 


(8.3.10)  /(m2)=         h(x,m2)dx 

J  -oo 

=  Cm2(N-3)/2e-Nm2/2a2  [°°  e~m2l2aldx 

=  Clm2(N-3)/2e-Nm2/2<r2 
where 

(8.3.H)  C-C(— ) 

The  constant  Ct  can  now  be  found,  since 


|  7(m2)  dm2  =  1 


I. 
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By  the  substitution  u  =  Nm2/2(j2,  this  becomes 

/2<r2\(N-1)/2  f00 
cJ—J  u<N-3)/2e~udu  =  l 

from  which,  since  the  integral  is  Ij — - — J, 


(N-D/2 


(8.3.13)  Cx  = 


■m 


This  gives,  with  Eq.  (10),  the  distribution  of  m2.  That  of  k2  is  found  from 
Eq.  (4),  since 

g{k2)  dk2  =f(m2)  dm2 

N  -  1 
=f(™2)—jj—dk2 

so  that 

jV-1      /jV-1     \^-3)/2 
(8.3.14)  g(k2)  =  ?LJl  Ci\~  k2j  e-cir-iW 

N—  1\(N~1)/2  e-(N-l)k2/2<r2 


-m 


fc,(»-3)/» 


ffl 


This  is  more  concisely  expressed  in  terms  of  n  =  N  —  1,  which  is  the  number 
of  degrees  of  freedom  in  the  expression  for  the  variance.  With  this  notation, 


(8.3.15)  9(fc2)  =  (j)"/2-A_)fe2( 


n-2)/2g-nfc2/2<y2 


If  we  put  nk2ja2  —  /2,  this  becomes  the  ordinary  /2  distribution  with  n 
degrees  of  freedom.  Thus, 

g{k2)  dk2  =f(X2)  dX2  =  4*/(z2)  <**2 


(72 


so  that 

(8.3.16)  /(*2)=K?) 

the  same  as  Eq.  (4.6.4). 

X2      nk2 


1/^y2^(n-2)/2    e~X2/2 

rW2) 


If  we  put  u  =  —  =  —J,  w  is  a  gamma  variate  with  parameter  «/2. 

Its  density  function  is 

f(u)  =  u{n-2)/2-^- 
JK  }  r(n/2) 
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8.4 


and  the  moments  and  cumulants  can  be  found  from  §  4.4.  The  rth  cumulant  of  u 
is  (n/2)(r  -  1)!,  so  that  the  rth  cumulant  of  k2  is 


.2\  r 


(8.3.17) 


=  ^-l)i(-2) 


D! 

r-l 


*  8.4  The  Geometrical  Approach  to  the  Joint  Distribution  of  the  Mean  and 
Variance  *  Because  one  cannot  visualize  space  of  more  than  three  dimensions, 
we  will  carry  through  the  discussion  for  TV  =  3.  The  argument  is  similar  for 
larger  samples. 

We  suppose  as  before  that  the  population  mean  is  zero.  The  observed  sample 
values  xu  x2,  x3  are  considered  as  the  coordinates  of  a  point  in  the  sample 
space  of  3  dimensions.  The  sample  mean  3c  is  given  by 

(8.4.1)  x  =H*i  +*2  +*a) 

For  a  given  value  of  x  this  equation  represents  a  plane  equally  inclined  to  all 
three  axes. 

The  sample  second  moment  m2  is  given  by 

(8.4.2)  m2  =  H(Xl  -  x)2  +  (x2  -  x)2  +  (x3  -  3c)2] 

and  for  given  3c  and  m2  this  represents  a  sphere  of  radius  (3m2)1/2  with  its 
centre  at  the  point  (x,  x,  3c). 

The  sphere  and  plane  intersect  in  a  circle  of  center  M,  where  OM  =  V  33c 
and  MP  =  (3m2)1/2  (see  Figure  39). 


Fig.  39    Element  of  three-dimensional  sample  space 

If  3c  increases  slightly,  the  plane  of  Eq.  (1)  moves  parallel  to  itself  in  the 
direction  OM  a  distance  d(OM)  =  dt  =  y/ldx. 
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If  m2  also  increases,  the  circle  of  centre  M  enlarges,  the  radius  increasing  by 
an  amount  d(MP)  =  (v  3/2)m2~1/2  dm2,  which  produces  an  increase  of  area 

dA  =  2nr  dr  =  27i(3m2)1/2(V3/2)m2_1/2  dm2 
—  3n  dm2 
The  volume  of  the  ring-shaped  element  so  formed  is 

(8.4.3)  dV  =  dA  -dt  =  3yJHn  dm2  dx 

The  probability  that  a  sample  point  will  he  in  this  element  of  volume  is,  by 
Eq.  (8.3.3), 

(8.4.4)  dP  =  (2na2y3/2  exp[-3(x2  +  m2)/2<r2]  dV 


3\/3 


2^2n 


a  3  exp[-3(x2  +  m2)/2(72]  dm2  dx 


and  this  is  the  same  as  Eq.  (8.3.9)  when  we  put  N  =  3.   By  Eqs.  (8.3.11)  and 
(8.3.13) 


C  = 


l  N  y/2/  NyN-i)/2  /jyy 

Utkt2/      \2a2/  1     \2a2) 


and  when  N  =  3  this  becomes 

1     /3\3/2      3V3 


(-)     - 

>Jti°3\2/  2yjln 


<j~3 


as  in  Eq.  (4). 

The  rest  of  the  argument  is  just  as  before.  In  the  Af-dimensional  case,  the 
hypersphere  of  N  dimensions  intersects  the  hyperplane  in  a  hypersphere  of 
N  -  1  dimensions.    The  radius  is  y/Nm2  and  the  "area"  is  ^(A^w2)(iV~1)/2, 

where    K  =  niN-1)/2/r(— ±-~\      The   element   of  "volume"    is   ^N  dx  dA 

=  KNN,2(— ^—  )m2(N_3)/2  dm2  dx  and  this,  with  Eq.  (8.3.3),  gives  the  same 
result  as  Eq.  (8.3.9). 

8.5  The  Distribution  of  Student's  t  If  x  is  the  mean  of  a  sample  of  N  from 
a  normal  distribution  with  mean  fi  and  variance  c2,  then,  as  we  have  already 
seen,  the  variate 

(8.5.1)  z  =  iV1/2^L^ 

a 

is  a  standard  normal  variate.  If,  however,  a  is  replaced  by  its  estimator  s  =  k21,2t 
the  quantity 

(8.5.2)  t  =  N1/2  ^—^ 
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is  not  normally  distributed,  except  asymptotically  for  large  N.  The  distribution 
of  f  (actually  of  tN~1/2)  was  originally  obtained  by  W.  S.  Gosset,  writing 
under  the  pen  name  of  "Student"  [2],  and  the  importance  of  this  distribution  in 
a  variety  of  practical  situations  was  later  emphasized  by  Fisher. 

The  joint  density  function  for  x  and  s  may  be  obtained  from  Eq.  (8.3.9)  byt 
noting  that  m2  =  (N—  l)s2/N ,  so  that 

(8.5.3)     /(3c,  s)  dx  ds  =  h(x,  m2)  dx  dm2 

JN-\     \<"-3>'2        r    Nx21       T    (N  -  l)s2l /iV  -  1\      J    J 

A  n-2    r  N*2i    r  (N  -  1)s2i  *-  * 

=  AsN   2exp|— ^-lexpl ——\dxds 

where  /N  -  1\(N~1)/2 

1/2  I     2     ) 


(?) 


o"r 


(^) 


and  where  we  are  assuming  as  before  that  \i  =  0. 

If  we  change  the  variables  from  x  and  s  to  /  and  5,  we  obtain  as  the  joint 
density  function  for  t  and  s 

(8.5.4)  g{t,  s)  =  Af-iN-V*  exp[-J!]exp[-^^] 

since,  for  fixed  s,  s  dt  =  N1/2  dx  and  since  g(t,  s)dt  ds  =  f(x,  s)  dx  ds. 

By  integrating  over  all  values  of  s,  we  obtain  the  desired  density  function  for 
t,  namely, 

(8.5.5)  f(t)=rg(t,s)ds 

-"!14CK) ' 

(see  Appendices  A. 6  and  A. 7).  Here  n  =  N  —  \  and  is  called  the  number  of 
degrees  of  freedom  for  t. 

The  important  characteristic  of  /(f)  is  that  it  is  independent  of  a,  and  there- 
fore tables  of  this  function  can  be  used  to  test  hypotheses  about  the  mean  of  a 
population,  irrespective  of  what  the  variance  may  be. 

The  graph  of /(f)  is  a  symmetrical  unimodal  curve,  tailing  off  towards  zero 
at  both  ends.  The  tails  are  higher,  and  the  central  peak  is  higher,  than  for  a 
normal  curve  of  the  same  mean,  variance  and  total  area  (see  Figure  40,  drawn 
for  n  =  4).  As  n  increases,  the  curve  becomes  more  and  more  nearly  normal. 
To  show  this  we  note  that 

/  ,2\-(n+l)/2  /  /2\-l/2  /  t2\~n'2 

(8.5.6)  lim   1+-  =lim    1+-  -lim   1 +- 
n-oo\        nj                  „-oo\        nj  n-ooV        nj 

=  (l)'(e-'2/2) 
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(see  Appendix  A.l),  and  by  using  Stirling's  approximation  (Appendix  A.2)  it  is 
easy  to  prove  that 

„U  L    \2  2)  J      /      -« (w||)i,ar/;\ 

The  limit  of/(0  is  therefore  (2n)~1,2e~t2,29  which  is  the  density  function  for 
a  standard  normal  variate. 


-1/2 


Fig.  40    Graphs  of  the  t-distribution  and  normal  distribution 


From  symmetry,  the  odd-order  moments  of  f(t)  are  all  zero,  but  the  (2r)th 
moment  is 

(8.5.7)  ,2,  =  2„-[B(l|]-p'(1+^p+1,/2* 


n\2r  -  l)(2r  -  3) ...  1 
~  (n  -  2)(n  -  4)  .  .  .  (n  -  2r) 

n  I"2 

Thus    Hi  = v       /^4  =  7 ^ "r:, 

«  —  2  («  —  2)(n  —  4) 

=  6/(«  —  4).  For  n  >  4  this  is  always  positive 


so     that     kJk22  =  nJfi22  —  3 


From  Eq.  (2),  —  = 
n 


t2      N(x  -  ii)2      [N1/2(x  -  n)l°Y 


nsx 


2/^2 


ns  ja 


.   The  numerator  of  this 


fraction  is  the  square  of  a  standard  normal  variate  (and  therefore  a  %2  variate 
with  one  d.f.)  and  the  denominator  (by  §  8.3)  is  an  independent  x2  variate  with  n 
d.f.  The  fraction  itself  is  therefore  the  ratio  of  two  gamma  variates  with  para- 
meters 1/2  and  w/2,  respectively  (see  §  4.6),  and  this  ratio  may  be  shown  to  have  a 
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beta-prime  distribution  with  parameters  1/2,  n/2.  It  is  often  useful  to  know  that 
any  statistic  t  has  the  Student- f  distribution  if  t2/n  is  the  ratio  of  two  independent 
variates  distributed  respectively  as  x2  with  1  and  n  degrees  of  freedom. 

8.6  Tables  of  t  and  Approximations  to  t    In  Appendix  B.4  there  is  a  table  of 
the  integral 

(8.6.1)  a  =  P(t  >  Q  =      f(t)  dt 

where  f(t)  is  given  by  Eq.  (8.5.5).  This  table  gives,  for  all  values  of  n  from 
1  to  30  and  for  selected  values  of  the  probability  a,  the  values  of  ta  satisfying 
Eq.  (1);  that  is,  it  gives  those  values  of  t  which  in  a  random  sample  of  size 
n  +  1  from  a  normal  population  will  be  exceeded  with  probability  a. 

Tables  of  /  often  give  instead  the  values  which  will  be  exceeded  numerically 
with  probability  a.   This  procedure  corresponds  to  the  equation 

(8.6.2)  a=P(|t|>fa)  =  l-     '  f(t)dt 

Because  of  the  symmetry  of  /,  this  probability  is  just  twice  that  given  by 

Eq.  (1).  If  the  two-tailed  probability  is  wanted  from  the  table  in  the  Appendix, 

the  probabilities  given  at  the  head  of  the  columns  should  be  doubled. 

in  -  2\1/2 
The  variance  of  t  is  n/(n  —  2),  so  that  1 1 1      is  a  standardized  variate 

and  is  approximately  normal  for  fairly  large  n  (say  n  >  30).  Thus,  the  approxi- 
mate value  of  t  from  Eq.  (1)  for  n  —  30  and  a  =  0.05  is  given  by  multiplying  the 
corresponding  normal  variate  (1.645)  by  (30/28)1/2.  This  gives  1.703,  whereas 
the  correct  value  is  1.697 

A  better  approximation,  given  by  Hendricks  [3],  is 

(8.6.3)  zkI 


(l)(t2+2ny 


12 


where  z  is  the  normal  variate  giving  the  same  probability  as  the  actual  /.  Thus, 
if  n  =  30  and  t  =  1.697,  the  corresponding  z  would  be  1.644,  which  is  quite 
close  to  the  true  value  1.645. 


mm 


If  n  is  even,  the  factor  Tl — - — )/r(-)    in   Eq.   (3)  is  equivalent  to 


-  [f-*->] 


n112      (»-  1)!  ,•■.,'.„■„..  •    ,  2 

2^M(H/2  -  l)l]»  and  lf  "   'S  °dd  rt   ,S   e"u,valent  t0  ^m-      („_!), 


8.7  EXACT  TESTS  ON  SAMPLES  FROM  A  NORMAL  POPULATION  183 

In  either  case  a  good  approximation  for  large  n  is 

"•"'       ^©>£*£--] 

An  excellent  approximation  to  /,  given  by  Cornish  and  Fisher,  is  the 
following: 

Let  ta  and  za  be  defined  by 

(8.6.5)  aL=\f{t)dt=\<t>(z)dz 
where  tf>(z)  =  (2jt)-1/2  <r*2/2.  Then 

rtStt  f    -t     I   Z'3+Za    I   5Z'5  +  16Z»3  +  3Z" 

(8.6.6)  ..  *  z,  +  -j—  + ^ 

For  n  =  30,  and  a  =  0.05,  za  =  1.645,  the  second  term  is  0.0508  and  the  third 
is  0.00158,  so  that  ta  «  1.697,  which  is  correct  to  four  figures. 

8.7  Confidence  Limits  for  the  Population  Mean    Equation  (8.6.2)  may  be 
written 


P 


(8.7.1)  1  -  a  =  2      f{t)  dt 

where  t  =  N1/2(x  —  fi)/s.  There  is  a  probability  1  —  a  that  the  observed  value 
of  the  statistic  t,  for  a  sample  of  size  N  from  a  normal  population  with  mean  p, 
will  lie  within  the  limits  +  ta  (see  Figure  40).  This  is  equivalent  to  the  statement 
that  the  100(1  —  a)%  confidence  limits  for  /*,  corresponding  to  observed 
values  of  3c  and  s  for  the  sample,  are  given  by 

(8.7.2)  fi=x±staN-1/2. 

Example  1  [4]  Electric  meters  are  adjusted  to  work  synchronously  with  a 
standard  meter.  After  adjustment,  a  sample  of  10  meters  was  tested  by  means 
of  precision  instruments.  If  the  standard  meter  is  rated  at  1000,  the  observed 
ratings  for  the  sample  were  as  given  under  x  in  the  following  table.  The  question 
to  be  answered  is  whether  the  meters  tested  can  reasonably  be  regarded  as  a 
random  sample  from  a  normal  population  with  mean  1000,  or  whether  there  is  a 
systematic  deviation  from  this  standard.  From  the  data,  x  =  994,  s2  =  (744  — 
160)/9  =  64.9,  and  if  a  =  0.05,  the  value  of  ta  for  nine  degrees  of  freedom  is 
2.262.   The  95  %  confidence  limits  for  /*,  therefore,  are 

*)  OfO     — — 
994  ±  —]=  V64.9  =  994  +  5.8 

Vio 

or  998.2  to  999.8.  Since  these  limits  do  not  quite  include  1000,  there  is  a  barely 
significant  deviation  (at  the  5  %  level)  from  the  standard  value  assumed. 
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The  null  hypothesis  here  is  that  \i  —  1000  and  the  alternative  hypothesis 
(which  we  are  led  to  accept)  is  that  fi  <  1000.  If  \i  were  greater  than  1000,  the 
probability  of  the  observed  t  value  (or  less)  would  be  less  than  2.5%. 

Table  8.1 


X 

M=JC-990 

M2 

983 

-7 

49 

1002 

12 

144 

998 

8 

64 

996 

6 

36 

1002 

12 

144 

983 

-7 

49 

994 

4 

16 

991 

1 

1 

1005 

15 

225 

986 

-4 

16 

40 

744 

8.8  Confidence  Limits  for  the  Difference  of  Means  in  Two  Populations    If  we 

suppose  that  two  independent  samples  come  from  two  different  normal  popu- 
lations with  means  \ix  and  \i2  but  with  a  common  variance  <r2,  we  can  form 
confidence  limits  for  the  difference  fxx  —  fi2.  If  these  limits  include  zero,  there 
is  no  significant  difference  (at  the  chosen  level)  between  the  means. 

Suppose  the  samples  are  of  sizes  Nt  and  N29  with  means  xu  x2,  and  variances 
sx2,  s22,  respectively.  An  unbiased  estimate  of  a2,  based  on  both  samples,  is 


(8.8.1) 


a2  = 


nxsx    +  n2s2 


where  nx  =  Nx  —  1,  n2  =  N2  —  1.    This  follows  at  once  from  the  fact  that 
E(st2)  =  a2  and  E{s22)  =  a2,  so  that  E{a2)  =  a2. 

By  hypothesis,  3c  t  and  x2  are  both  normal  with  means  /^  and  \i2  and  variances 


g 2/N1   and  G2jN2i  respectively.    Therefore,  x] 


is  normal  with  mean 


1/2 


Ah  —  /*2  and  variance  <r(\-INx  +  l/N2).   If  we  substitute  for  g    the  unbiased 
estimate  of  Eq.  (1)  we  obtain  the  statistic 

[x1-x2-(,1-,2)][^(i-  +  i-)] 

which  has  the  Student-f  distribution  with  nx  +  n2  degrees  of  freedom.    The 
100(1  —  a)%  confidence  limits  for  ^  —  \i2  are  therefore  given  by 

1/2 


(8.8.2) 


Ml  -  ^2  =  *1  ~  *2  ±  *a 


piiSi    +  n 

[      nx  +n 


2s2z  Nt  +  N2 


NtiV2 
Example  2    Two  batches  of  concrete  were  made  with  slightly  different 
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qualities  of  sand.  From  each  batch  four  cylinders  were  made  up  and  tested  for 
compressive  strength  (lb/in.2),  with  the  results  shown: 

Batch  No.  Values  of  X  X 

1  1690,  1580,  1745,  1685  1675 

2  1550,  1445,  1645,  1545  1546 

The  variances  for  the  two  samples  are  4750  and  6673,  respectively.  These  values 
are  close  enough  to  justify  an  assumption  that  the  two  batches  do  not  differ  in 
variance  (a  test  for  this  will  be  given  later,  §  8.13).  The  question  is  whether  the 
means  differ  significantly.  We  have  xx  —  x2  =  129,  nx  =  n2  =  3,Nt  =  N2  =  4, 
and  the  95%  confidence  limits  for  jut  —  \x2  are 

rii423  n1/2 

129  +  2.447 


Wi 


=  129  +  131  or  -2  to  260. 

These  limits  include  zero;  therefore,  at  the  level  of  significance  chosen,  the 
hypothesis  that  there  is  no  difference  between  the  means  for  the  two  batches 
must  be  accepted.  The  observed  difference  is,  however,  almost  significant  at 
this  level. 

8.9  Confidence  Limits  for  the  Difference  of  Means  in  Paired  Samples    In 

some  types  of  experimental  work,  the  two  samples  which  are  compared  are  not 
independent  random  samples  but  are  deliberately  paired  in  such  a  way  as  to 
reduce  as  much  as  possible  all  accidental  differences  other  than  those  due  to  the 
particular  effect  which  is  being  investigated.  Thus  in  testing  the  effect  of  a  drug 
on  some  property  of  the  blood,  the  same  group  of  experimental  animals  might 
be  examined  before  and  after  administration  of  the  drug.  This  procedure  renders 
the  experiment  more  precise,  since  it  eliminates  the  random  variability  between 
one  group  of  animals  and  another;  this  variability  might,  or  might  not,  affect 
the  particular  property  under  investigation.  If  the  sample  size  is  N,  the  number  of 
degrees  of  freedom  is  n(=  N  —  1)  instead  of  2n  as  it  would  be  if  two  independent 
random  samples  of  size  N  had  been  used. 

Again,  in  comparing  the  yields  of  two  varieties  of  apple,  one  can  imagine  an 
experiment  in  which  pairs  of  trees  of  the  two  varieties  are  grown  side  by  side,  in  a 
dozen  different  locations.  Differences  of  soil  fertility,  drainage,  chemical 
composition  of  the  soil,  etc,  will  then  be  almost  entirely  eliminated  from  the 
comparison  of  yields,  since  each  variety  in  any  one  pair  is  growing  under  almost 
the  same  conditions  as  the  other  variety,  and  the  method  of  paired  samples  would 
be  applicable. 

The  method  of  analysis  in  such  a  case  is  to  obtain  the  N  differences  between 
members  of  a  pair,  dt  =  xu  —  x2i,  where  the  subscript  1  refers  to  one  sample 
and  the  subscript  2  to  the  other.  (Thus  1  might  refer  to  an  animal  before  treat- 
ment and  2  to  the  same  animal  after  treatment).  On  the  null  hypothesis  that  the 
treatment  has  really  no  effect  (in  the  case  of  the  apple  trees,  that  the  varieties 
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8.9 


do  not  really  differ  in  mean  yield),  the  expectation  of  dt  is  zero.   If  s2  is  the 
variance  of  the  differences,  i.e., 


(8.9.1) 


ns 


=  I^2-N3: 


the  quantity  dN1/2s   1  has  the  Student-r  distribution  with  n  degrees  of  freedom. 
On  the  hypothesis  of  a  true  difference  S  between  the  means, 

(8.9.2)  @-8)N1,2s-1  =  ±ta 

where  ta  is  the  value  of  t  which  is  exceeded  numerically  with  probability  a.  Then 
Eq.  (2)  can  be  written 

(8.9.3)  d=d±staN-1/2 

which  gives  100(1  —  <x)%  confidence  limits  for  6. 

Example  3  The  following  table  gives  pH  values  for  the  arterial  blood  of 
dogs  (a)  breathing  normally,  (b)  after  a  period  of  breathing  air  containing  5  % 
carbon  dioxide. 

Table  8.2 


Dog 

(a) 

(b) 

d  = 

Number 

Xl 

X2 

Xl   —  X2 

1 

7.42 

7.26 

0.16 

2 

7.53 

7.30 

0.23 

3 

7.36 

7.26 

0.10 

4 

7.43 

7.39 

0.04 

5 

7.43 

7.38 

0.05 

6 

7.15 

6.69 

0.46 

7 

7.50 

7.32 

0.18 

8 

7.34 

7.26 

0.08 

9 

7.45 

7.23 

0.22 

10 

7.42 

7.06 

0.36 

11 

7.53 

7.34 

0.19 

12 

7.48 

7.28 

0.20 

13 

7.42 

7.29 

0.13 

The  mean  value  of  d  is  5  =  0.1846,  and  s2  =  (0.6149  -  0.4430)/12  =  0.0143, 
so  that  dNll2s~l  =  5.57.  For  12  d.f.,  the  1  %  value  of  t  (for  a  two-tailed  test)  is 
3.055,  so  that  the  probability  is  considerably  less  than  0.01  that  a  random  sample 
of  13  animals  would  exhibit  a  mean  difference  as  great  numerically  as  that  found 
if  there  were  really  no  effect  of  the  treatment.  The  hypothesis  that  breathing  5  % 
of  C02  has  no  effect  on  the  pH  value  for  the  blood  is  decisively  rejected. 

If  one  could  feel  quite  confident,  before  the  experiment  is  performed,  that  if 
there  is  any  effect  it  could  be  only  one  way  (could  result  only  in  a  lowered  pH 
value)  one  would  be  justified  in  using  a  one-tailed  test.  The  1  %  value  is  then 
2.681  and  the  probability  of  the  observed  result  on  the  null  hypothesis  is  even 
lower  than  before. 
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In  terms  of  confidence  limits,  the  99  %  confidence  limits  for  S  would  be 
0.1846  +  0.0332  (3.055),  or  0.083  to  0.286.  This  expresses  in  another  way  the 
fact  that  there  is  a  highly  significant  difference  produced  by  the  treatment. 

*  8.10  The  r-Test  and  Maximum  Likelihood  Suppose  the  null  hypothesis 
H0  is  that  our  sample  of  N  values  of  X  comes  from  a  normal  population  with 
mean  ja0  and  variance  o29  and  the  alternative  hypothesis  H1  is  that  the  sample 
comes  from  a  normal  population  with  mean  ^{>  fi0)  and  the  same  variance  a2. 
For  convenience,  we  can  take  fi0  as  zero  (this  simply  means  subtracting  a 
constant  amount  fi0  from  all  the  observed  values  of  X). 

Under  H0i  the  likelihood  for  the  particular  set  of  observed  values  xl9  x2  . . . 
Xjvis 

(8.10.1)  L0  =(27Eff2)-"'2exp(-£|£) 
while  under  H^  it  is 

(8.10.2)  Li  =  (2KaTNI2  exp(-£  ^Jr^) 
The  region  of  rejection  R  must  satisfy  the  condition 


(8.10.3) 


L0  dx±  .  .  .  dxN 


for  a  fixed  value  a  of  the  probability  of  wrongly  rejecting  H0.  However,  R  can 
be  chosen  in  many  ways.  To  make  the  test  as  powerful  as  possible  we  should 
choose  R  to  maximize  the  probability  of  rejecting  H0  when  H1  is  true.  That  is, 
the  probability 


-1 


(8.10.4)  P  =  |     Lldx1...  dxN 

(R) 

is  to  be  a  maximum  subject  to  the  condition  of  (3).  This  implies  that  we  should 
maximize        {L1  —  AL0)  dx1  .  .  .  dxN  without  restriction,  X  being  a  Lagrange 

J  (R) 

multiplier  (see  Appendix  A.  15).    If  we  include  in  R  all  the  points  for  which 
Lx  —  XL0  >  0  and  exclude  all  points  for  which  L1  —  XL0  <  0  we  shall  make  the 
integral  as  large  as  possible.  The  boundary  of  R  is  given  by  Ly  —  XL0  =  0. 
This  equation  is  equivalent  to 

log  Li  =  log  X  +  log  L0 

which  reduces  to 

X  (xt  -  ^i)2  =  X  xt2  -  c,  c  =  2<t2  log  X 

This  is  again  equivalent  to  x  =  cl9  where  ct  is  another  constant  depending  on 
<r,  nl9  N  and  X.  The  region  of  rejection  is  defined  by  x  >  ct. 
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8.10 


The  situation  can  perhaps  be  appreciated  geometrically  for  the  special  case 
N  =  3,  as  in  §  8.4.  The  more  general  case  requires  a  familiarity  with  TV- 
dimensional  geometry.  The  likelihood  L0  is  constant  on  the  sphere  £  xt2  = 
constant  (with  center  at  the  origin).  Equation  (3)  implies  that  on  any  such  sphere 

we  can  pick  a  region  of  rejection  R 
equal  in  area  to  a  fixed  fraction  a  of  the 
surface,  and  we  can  agree  to  reject  H0 
when  the  sample  point  lies  in  this 
region.  The  maximum  likelihood 
condition  implies  that  this  region  is 
the  "cap"  cut  off  on  the  sphere  by  the 
plane  3c  =  cl9  this  plane  being  per- 
pendicular to  a  line  equally  inclined 
to  all  three  axes.  (See  Fig.  41,  where 
the  shaded  area  represents  the  cap 
cut  off  the  sphere  by  the  plane.)  The 
boundaries  of  all  the  caps  for  different 
spheres  lie  on  a  circular  cone,  with 
vertex  at  the  origin. 
The  plane  3c  =  cY  lies  at  a  distance  \/3cl  from  the  origin.  The  sphere 
£  xt2  =  c2  is  of  radius  V  c2  and  the  fractional  area  of  the  cap  is  given  by 

\Jc2  -  V3ct 


E*;2=c2 


Region  of  rejection  of  Ho 

WHEN  N  =  3 


(8.10.5) 


2v; 


Now  the  •sample  variance  for  a  sample  with  representative  point  (xl9  x2,  x3), 
lying  on  the  intersection  of  the  sphere  and  the  plane,  is 


(8.10.6)  s1  =  i[(x, 

=  i(c2 
and  t  for  this  sample  is 


"  *)2  +  (*2  -  *)2  +  (*3 

-  jc22  +  x32  -  33c2) 


3c)2] 


(8.10.7) 

ForiV 

(8.10.8) 


(c2  -  3c,2)1/2 
3  (and  therefore  n  =  2)  the  distribution  of  Student's  t  reduces  to 

/(0=(2v/2)-1(l+r2/2)-3/2 
and  the  integral  of  this  from  ^  to  oo  is 


(8.10.9) 


I 


m  dt= hi  -  *„(2  +  <«2r1/2] 


Substituting  for  ta  the  value  given  by  Eq.  (7)  we  find  precisely  the  a  of 
Eq.  (5),  so  that  the  one-tailed  f-test  (namely,  reject  H0  when  /  >  ta)  is  the  same 
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as  the  maximum  likelihood  test  described  above.  It  follows  from  Eqs.  (5)  and 
(7)  that  /  =  (1  —  2<x)(2a  —  2a2)" 1/2  so  that  the  value  of  t  for  such  a  sample  is 
independent  of  the  value  of  a2  and  of  the  particular  value  of  jjl1  chosen  for 
hypothesis  H1.  The  test  is  therefore  uniformly  most  powerful  against  any  HY 
with  ^!  >  0. 

Similarly,  if  z^!  <  0  the  test  —t  >  ta  is  uniformly  most  powerful.  However,  if 
Hi  may  be  greater  or  less  than  zero,  no  uniformly  most  powerful  test  exists, 
except  in  the  special  class  of  unbiased  tests.  A  test  is  said  to  be  unbiased  if  its 
power  function  for  testing  the  hypothesis  that  a  parameter  6  is  equal  to  60  has  a 
minimum  at  the  value  60.  The  ordinary  two-tailed  /  test  (\t\  >  ta)  provides  a 
uniformly  most  powerful  unbiased  test  of  H0  against  Hu  where  Ht  is  the 
hypothesis  \fix\  >  0. 

*  8.11  The  Power  of  the  f-Test  The  power  is  the  value  of  P  given  by  Eq. 
(8.10.4)  subject  to  the  condition  of  (8.10.3).  As  in  §  8.5,  the  probability  Lx  dxt 
.  .  .  dxN  can  be  expressed  as  f(x,  s)  dx  ds,  except  that  now  the  population  mean 
is  to  be  taken  as  /ix  instead  of  zero.  We  have 

(8.11.1)  f(x,  s)  =  A*-*  exp(-g)exp[-iV(^~2^)2] 


where  A  =  {IN In)111  {nil) 


l/2^/0W2 


•*©] 


with  n  =  N  —  1  as  usual. 


If  we  define  t  as  /  =  N1/2x/s,  we  can  find  P  by  integrating  Eq.  (1)  over  all 
3c  >  N~1/2sta  and  over  all  s  from  0  to  oo.  For  any  point  in  the  region  so  de- 
limited, /  will  be  greater  than  ta.   Therefore, 


(8.11.2)       P  =  A 


sn-xe 


;W  P        expf- 


N(x  -  iix)* 


0  2(7 


dx  ds 


On  putting  z  =  N1/2(x  —  fi^/a  and  y2  =  ns2/a2,  so  that  z  is  a  standard  normal 
variate  and  y2  is  the  ordinary  %2  variate  with  n  degrees  of  freedom,  Eq.  (2) 
becomes 

1  foo  /    2\(n-2)/2  (*oo 

(81L3)  P  =  2?W2)J0(t)         •""'J,  *>*«*> 

where 

<ftz)  =  (2ny1/2e-z2/2 
and 

(8.11.4)  y  =  taxn~1/2  -  ^Nll2o-1 

This  integral  can  be  evaluated  numerically  for  given  values  of  «,  a,  and 
hJg.  The  power  function  depends  on  o  (through  the  quantity  y)  and  therefore 
some  preliminary  information  about  o  is  necessary  if  we  want  to  use  the  power 
function.  We  could,  for  example,  calculate  the  size  of  sample  necessary  to 
detect,  with  a  given  probability,  a  given  deviation  of  \ix\c  from  zero  but  without 
the  information  about  a  itself  we  cannot  say  anything  about  fil. 
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*  8.12  The  Non-Central  t -Distribution    If  the  distribution  of  t  on  hypothesis 
Ht  is  calculated  as  in  §  8.5  from  Eq.  (8.11.1),  we  obtain,  after  some  reduction, 

(8.12.1)  Mt)  =  Cl  1  +  - 1  e-nk2/t2Hh(k) 


where 


and 

(8.12.2)  Hh(k)  =  \     -"  e~i(v+k)2  dv 

Jo   n\ 

This  function  is  known  as  Airey's  function.  The  quantity  S,  defined  by 

(8.12.3)  <5  =  Ni/2  — 

o 

is  called  the  non-centrality  parameter.  Any  statistic  of  the  form 

(8.12.4)  t=(z+d)w~l/2 

where  z  is  a  standard  normal  variate  and  nw  is  an  independent  variate  distri- 
buted as  x2  with  n  degrees  of  freedom,  has  the  non-central  /-distribution. 

2(„-l)/2r/^±i\ 

When  S  =  0,  k  =  0  and  Hh(k)  = 2      . 

n\ 

The  density  function  then  reduces  to  that  for  the  ordinary  Student-/,  Eq. 

(8.5.5). 

The  power  of  the  t  test  is  the  integral  offx{t)  from  ta  to  oo, 


(8.12.3)  P  =      Mt)  dt 
where  ta  is  given  by 

(8.12.4)  a=  \f(t)dt 


This  gives  the  same  result  as  Eq.  (8.1 1.3). 

Extensive  tables  of  non-central  /  have  recently  been  provided  by  Resnikoff 
and  Lieberman  [5].  These  give  the  density  function/^/),  the  cumulative  distri- 
bution function  Fx(t),  and  certain  percentage  points  of  the  distribution.  Since 
/i(f )  depends  on  two  parameters,  n  and  <5,  the  tables  are  of  triple  entry.  Values  of 
n  go  from  2  to  49,  2(1)24(5)49.*  The  argument  used  is  t/y/n  instead  of  /,  this 
arrangement  being  more  convenient  for  tabulation.  The  parameter  <5  is  expressed 
in  terms  of  za9  where  za  is  the  standard  normal  variate  exceeded  with  probability 

This  notation  means  that  n  goes  by  steps  of  1  to  n  =  24  and  then  by  steps  of  5  to  n  =  49. 
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a — as  in  Eq.  (8.6.5) — the  quantity  tabulated  being  S  =  N1,2za,  for  ten  selected 
values  of  a  from  0.001  to  0.25. 

A  rough  approximation  to  the  non-centrality  parameter  for  moderate-sized 
samples  is 

/l  +t  V/2 

(8.12.5)  ,_,._*,{_£-) 

where  zP  is  the  standard  normal  variate  exceeded  with  probability  P.  If  fi  is  the 
probability  of  error  of  the  second  kind,  P  —  1  —  /?.  This  approximation  is 
useful  in  estimating  the  difference  in  the  mean  which  can  be  detected  with  given 
probabilities  of  error. 

Example  4  Suppose  we  are  using  a  sample  size  N  =  17,  and  are  willing  to 
allow  errors  a  =  0.05,  p  =  0.2.  That  is,  we  will  accept  a  risk  0.05  of  wrongly 
rejecting  the  hypothesis  that  \l  =  0  and  a  risk  0.2  of  wrongly  accepting  the 
hypothesis  that  \i .  =  jiv  How  large  must  fj.t  be? 

Using  the  approximation  Eq.  (5),  with  n  =  16, 

ta  =  1.746,        zP  =  -0.842 

5  =  1.746  +  0.842(1.0466)  =  2.627 

and  therefore  jUj/cr  =  (5/V17  =  0.637.  This  means  that  we  should  have  about  an 
80  %  chance  of  detecting  a  real  difference  in  the  mean,  from  the  assumed  value 
zero,  equal  to  about  0.64  times  the  standard  deviation. 

Some  tables  compiled  by  Neyman  and  Tokarska  [6]  are  rather  more  con- 
venient than  the  larger  tables  [5]  for  this  particular  type  of  problem.  They  give 
for  a  =  0.05  and  0.01  and  for  n  =  1(1)30,  the  values  of  S  corresponding  to 
selected  values  of  fi.  From  these  tables,  with  a  =  0.05,  /?  =  0.2,  and  n  =  16, 
we  find  S  =  2.60,  giving  fi1/a  =  0.631. 

*  8.13  Sampling  Inspection  by  Variables  A  procedure  which  depends  upon 
non-central  t  is  that  of  accepting  or  rejecting  a  lot  according  to  the  percentage  of 
defectives  /?,  where  "defective"  is  defined  as  meaning  that  a  measured  random 
variable  X  has  a  value  above  some  fixed  standard  u.  This  variate  X  is  supposed 
to  be  normal  with  unknown  parameters  \i  and  a.  The  method  is  to  measure  the 
mean  m  and  the  standard  deviation  s  for  a  sample  of  TV  and  accept  the  lot  if 

(8.13.1)  m  +  ks  <  u 
where  k  is  a  constant.  This  criterion  can  be  written  as 

s 
or 

sJ~N(u  -  n)      Viv(m  -  n) 

(8.13.2) ■ >  ViV/c 

SJG 
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Now  ns2/o2  is  distributed  as  x2  with  n  degrees  of  freedom,  and  \N(m  —  fi)/<j 
is  a  standard  normal  variate,  so  that  the  left-hand  side  of  (2)  has  a  non-central  t 
distribution*  with  n{=  N  —  1)  d.f.  and  non-centrality  parameter  given  by 

(8.13.3)  g=V^(»-^)=v- 

G 

where  zp  is  the  standard  normal  variate  exceeded  with  probability  p.  The 
acceptance  criterion  is  therefore 


(8.13.4)  r>ViV/c,        *  =  Vn 


u  —  m 


The  power  function  (the  probability  of  accepting  the  lot)  is 
(8.13.5)  P=T  Mt)dt 


J"-" 


and  several  values  can  be  found  by  interpolation  in  the  tables  of  Resnikoff  and 
Lieberman  for  given  TV  and  k  and  different  p.  Thus  if  N  =  10  and  k  =  1 .72,  we 
find  that  when  p  =  0.10  (corresponding  to  S  =  4.053),  P  =  0.224,  and  when 
p  =  0.004  (corresponding  to  S  =  8.386),  P  =  0.969. 

Conversely,  if  we  fix  two  points  on  the  power  curve,  we  can  find  TV  and  k 
and  so  set  up  a  sampling  acceptance  plan.  Thus  suppose  we  want  the  values  of  P 
corresponding  to  px  =  0.01  and/?2  =  0.15  to  be  1  -  a(=  0.99)  and  fi(=  0.10), 
respectively.  That  is,  if/?  is  as  low  as  0.01,  we  shall  be  almost  certain  (probability 
0.99)  to  accept  the  lot.  If/7  is  as  high  as  0.15,  we  shall  be  very  likely  (probability 
0.9)  to  reject  it.  The  corresponding  values  of  N  and  k  are  found  by  trial  and 
error,  using  the  tables.  We  want  to  find  two  consecutive  values  of  «,  say  n  —  1 
and  «,  such  that  for  n  —  1  there  is  a  /'  for  which  simultaneously 


ft(t)dt<  0.99,  S  =y/N-  lz0.01  =2326>jN-  1 
and 

/i(0  dt  >  0.10,  d  =  Vn-1z0.15  =  1.036\/N-1 
while  for  n  there  is  a  t"  for  which  simultaneously 

ft(t)  >  0.99,        S  =  VNzo.qi  =  2.326VN 
and 

/*co  

/i(0  dt  <  0.10,         <5  =  y/Nz0ml5  =  1.036VN 


i; 


♦The  negative  of  this  side  has  the  non-central  t  distribution  with  parameter  —8,  but  this 
is  the  same  as  saying  that  the  side  itself  is  non-central  /  with  parameter  S. 
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We  find  that  for  16  d.f.,  with  p  =  0.01  and  t'/4  =  1.55,  P  is  0.9896,  while  for 
17  d.f.  and  the  same/?  and  /",  P  is  0.9942.  Also,  for  16  d.f.  and  t'/4  =  1.55,  with 
p  =  0.15,  P  =  0.1064  while  for  17  d.f.  with  the  same  p  and  t\  P  =  0.079.  We 
can  therefore  choose  n  ==  17  (N  =  18).  Here  it  happened  that  /'  and  /"  are 
identical.  With  n  =  17,  we  find  by  interpolation  that  t'  =  6.43  corresponds  to 
P  =  0.99;  k  is  therefore  6.43/Vl8  =  1.516.  The  sampling  plan  is  to  take  a 
sample  of  size  18  and  accept  the  lot  if  /  >  6.43.  This  will  be  a  little  stricter  than 
desired  but  near  enough  for  practical  purposes. 

8.14  Confidence  Limits  for  the  Variance  of  a  Population  Since  the  quantity 
ns2/cr2  is  distributed  as  y2  v^ith  n  d.f.  for  samples  from  a  normal  parent  popu- 
lation, it  is  easy  to  construct  confidence  limits  for  o2  corresponding  to  a  given 
sample  variance.  It  is  merely  necessary  to  find  from  a  table  of  %2  the  values 
%2  and  %22  such  that 


(8.14.1) 


2  Z 


X2  a 

f(x2)dX2=2- 


0 

Then  the  lower  and  upper  confidence  limits  for  a2  are  given  respectively  by 

ns2  ,      ns2 


(8.14.2)  -         al2=—i        <722-  — 

Xi  Xi 

and  the  confidence  coefficient  is  100(1  -  a)%.  It  is  not,  of  course,  necessary  that 
the  two  tails  should  each  be  a/2  in  area,  provided  that  the  sum  is  equal  to  a.  How- 
ever, it  is  usual  to  take  them  as  equal. 

Example  5    The  variance  of  a  sample  of  size  10  is  0.064.  What  are  the  95  % 

confidence  limits  for  a2  ? 

The  values  of  x2  anc*  Xi    corresponding  to  a  =  0.05  are  19.023  and  2.700, 

,       0.576 
for  9  d.f.   The  confidence  limits  are  therefore  <r,    =  — — —  =  0.030  and  a, 

1         19.023  2 

-SS-  o-213- 

2.700 

Confidence  limits  for  the  standard  deviation  g  should,  strictly,  be  obtained 
from  the  distribution  of  s.  Nevertheless,  it  is  customary  to  obtain  the  limits  for 
a2  and  use  the  positive  square  roots,  in  spite  of  the  fact  that  the  square  root  of  s2 
is  not  an  unbiased  estimator  for  a. 

8.15  Distribution  of  the  Variance  Ratio  Suppose  s2  and  s22  are  the  ob- 
served variances  for  two  samples,  of  sizes  Nx  and  N2,  drawn  from  normal  popu- 
lations with  variances  o2  and  <j22  respectively.  We  can  test  the  null  hypothesis 
H0  that  g2  =  g2  by  calculating  the  distribution  of  the  variance  ratio 

2, 


(8.15.1)  F  =  s,2/s2 
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This  ratio  is  generally  denoted  by  F  in  honour  of  Sir  R.  A.  Fisher,  although 
Fisher  originally  used  the  related  statistic 

(8.15.2)  ^=iloge(s12/522)=ilogeF, 

which  has  a  more  nearly  symmetrical  distribution  than  F. 
On  the  null  hypothesis,  with  o22  =  a2. 


(8.15.3)  „_..,- 


n2       n2s22/(jl2      Xn2 


both  the  numerator  and  denominator  being  y2  variates  with  nx  and  n2  d.f. 
{nl  =  Nt  —  1  and  n2  =  N2  —  1).  The  ratio  is  therefore  a  beta-prime  variate 
(see  §  4.5)  with  parameters  a  =  nJ2,  /?  =  «2/2,  and  its  density  function  is 

(8.15.4)  f(x)  =  x*-\l  +  xya-pIB(a,  ft 

with  x  =  nyF\n2. 

The  density  function  for  F  is  given  by 

flf(F)  dF  =  g(F)  —  dx  =/(x)  rfx 

so  that 

m/2 


•(t-t)('+x) 


f(»,/2)-l 

BlfxC.*..)/2'        0<F<oo 


The  numbers  «t  and  «2  are  called  the  degrees  of  freedom  for  F.    This  is  a 
positively  skew  distribution.    The  mode  (the  value  of  F  corresponding  to  a 

maximum  of  g(F))  is  at  F  =  —•— which  is  always  less  than  1.    The 

nt  n2  +  2 

expectation  of  F  is  given  by 

(8.15.6)  E(F)  =  n2l(n2  -  2),         n2  >  2 

which  is  independent  of  nx  and  is  always  greater  than  1. 

The  distribution  of  Fisher's  z  is  found  by  writing  F  =  e2z,  dF  =  2Fdz,  and 
its  density  function  is 

2n  "l/2n  "2/2  e"lZ 

(8.15.7)  /(Z)  =         —  •  M_x(ni+n2)/2 


8.16  Tables  of  the  Distributions  of  F  and  z  Table  B.5  in  the  Appendix  gives 
for  various  values  of  nx  and  n2  the  upper  5%  and  the  upper  1  %  points  of  the 
distribution  of  F.   A  complete  table  of  the  probability  integral  of  F  would  be 
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quite  bulky,  being  a  triple-entry  table,  but  since  we  usually  merely  want  to  know 
whether  an  observed  F  value  is  significant  or  not,  it  is  sufficient  to  have,  for  each 
pair  of  values  of  nl  and  n2,  a  few  values  of  F  corresponding  to  common  levels  of 
significance.  In  the  tables  of  Fisher  and  Yates  [7],  20%,  10%,  5%,  1%  and 
0.1  %  points  are  given  for  both  F  and  z. 

Interpolation  for  nl  and  n2i  when  necessary,  should  be  harmonic  instead  of 
linear.  Thus,  suppose  the  1  %  point  is  required  for  nx  =  60  and  n2  =  55.  The 
table  gives  the  1  %  points  for  50  and  55  and  for  75  and  55  as  1.90  and  1.82, 
respectively.    If  x  is  the  value  for  60  and  55,  harmonic  interpolation  gives 


1.90  -  jc        1/50  -  1/60 


1.90  -  1.82      1/50  -  1/75 


,  so  that  x  =  1.86. 


Example  6    For  two  samples,  of  sizes  8  and  12,  the  observed  variances  are 

0.064  and  0.024,  respectively.   Since  the  table  refers  only  to  the  upper  points  of 

the  distribution,  we  will  take  the  subscript  1  to  refer  to  the  sample  with  the  larger 

0.064 
variance.  Then  nx  =  7,  n2  =  11,  and  F  =  =  2.67,  with  7  and  11  degrees 

of  freedom. 

The  5  %  point  is  3.01  and  the  1  %  point  is  4.88.  The  probability  of  a  value  of 
F  at  least  as  great  as  2.67  is  therefore  more  than  0.05,  and  the  two  samples  are 
not  significantly  different  at  this  level.  From  the  Fisher  and  Yates  tables  we  note 
that  the  10%  point  is  2.34,  so  that  the  difference  is  significant  at  the  10%  level. 


1  F — ► 

Fig.  42    The  F-distribution 


A  departure  from  the  null  hypothesis  of  equality  of  variances  could  just  as 
well  give  a  value  of  F  less  than  1  as  a  value  greater  than  1.  Corresponding  to  any 

F2  such  that  f  ™  g(F)  dF  =  a  there  is  a  value  of  Fx  such  that  f%(F)  dF  =  a 

J  F2  JO 

(see  Figure  42,  where  it  is  assumed  that  a  <  0.5). 
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If  we  put  u  =  1/F  in  the  equation 


(8.16.1)  a  =  |    g(F)  dF 

we  obtain 


(8.16.2) 


-J. 


»2/2 

M(»2/2)-ldw 


n2)/2 


The  integrand  here  is  the  same  as  g(u\  in  Eq.  (8.15.5)  but  with  n1  and  n2  inter- 
changed, and  \/F2  is  the  same  as  Ft.  Therefore  to  find  the  lower  5  %  point,  say, 
for  a  given  n1  and  n2  we  take  the  reciprocal  of  the  upper  5  %  point,  after  inter- 
changing «!  and  n2.  This  makes  it  unnecessary  to  have  tables  for  both  ends  of  the 
distribution.  It  should  be  noted,  however,  that  when  we  use  a  two-tailed  test 
(supposing  that  the  departures  from  equality  of  variance  may  be  in  either 
direction)  the  probabilities  given  in  the  table  must  be  doubled.  The  5  %  point 
becomes  a  10%  point,  for  example. 

If  we  write  F  =  —  • ,  or,  equivalently, 

«!        x 

(8.16.3)  x  =  n2(n2  in/)"1 

it  is  a  straightforward  matter  to  show  that  x  is  a  beta-variate  with  parameters 
«2/2,  «i/2.  The  distribution  function  of  x  is  therefore  an  incomplete  beta  function 
and  the  tables  of  this  function  can  be  used  to  calculate  the  probability  that  x  is  less 

than  some  observed  value.   Thus  in  Example  6,  we  should  have  x  = — 

=  0.371.  The  probability  of  a  value  not  greater  than  this  is  Ix\— ,  -I  =  0.071, 
which,  as  previously  noted,  is  greater  than  0.05  but  less  than  0. 1 . 

*  8.17  The  Power  of  the  F-Test  As  in  §8.15,  we  assume  that  the  null 
hypothesis  H0  is  that  o2  =  g2  (both  populations  being  normal).  Let  the 
alternative  hypothesis  Hx  be  that  o*l<y22  =  A,  which  we  may  take  greater  than 
1.  Then  H0  is  rejected  if  F  >  F2  where 


(8.17.1)  g(F)dF=oc 


r 

JF2 


The  power  of  this  test  is  the  probability  that  F  >  F2,  under  Hti 
(8.17.2)  JP  =  Pr(F>F2\Hl) 

S22l°2 

with  n1  and  n2  d.f.  This  follows  because  {n1ln2)-(Fo22l(r12)  is  the  ratio  of  two 


Now  the  statistic  Fo2\o2  =    *      12  on  hypothesis  Hx  has  the  F  distribution 
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X2  variates  with  nt  and  n2  d.f.,  respectively.  If,  therefore,  FP  is  a  value  of  F  which 
is  exceeded  with  probability  P,  under  H0, 

(8.17.3)  P  =  Py(f  ^  >  Fp\HA 

From  Eqs.  (2)  and  (3),  F2  =  g12FpIg22  =  AFP. 

We  can  use  tables  of  F  (preferably  those  by  Merrington  and  Thompson,  see 
[7])  to  calculate  X  for  selected  values  of  P.  For  example  if  Nl  =  N2  =  10, 
a  =  0.05  and  P  =  1  -  P  =  0.5,  we  find  F2  =  3.18  and  FP  =  1.00,  so  that 
X  =  3.18.  This  means  that  we  have  an  even  chance  of  recognizing  a  difference 
between  the  variances  when  one  is  3. 18  times  the  other  if  we  agree  to  accept  a  5  % 
chance  of  rejecting  the  null  hypothesis  when  the  variances  are  really  equal. 

The  tables  can  also  be  used  to  estimate  the  size  of  sample  necessary  to  have  a 
given  chance  of  observing  a  stated  difference  in  variance.  Thus  suppose  that  a 
suggested  new  process  of  manufacturing  some  metal  part  might  be  expected  to 
reduce  the  standard  deviation  of  tensile  strength  by  a  factor  of  1.41  (which 
means  halving  the  variance).  We  would  need,  in  order  to  have  an  even  chance 
of  detecting  such  an  effect,  samples  of  size  25,  and  to  have  a  95  %  chance  we 
would  need  samples  of  nearly  100,  the  rejection  error  remaining  at  5%. 

*  8.18  The  Variance  of  Sample  Skewness  and  Kurtosis  It  is  possible  by 
means  of  long  and  rather  tedious  algebra  to  work  out  the  moments  of  k3,  k4,  etc. 
in  samples  from  a  population  with  known  cumulants.  Even  for  a  normal  parent 
population  the  expressions  are  long,  for  any  moments  above  the  second.  Here 
we  shall  simply  state  a  few  results  for  the  normal  case. 
For  the  third  ^-statistic, 


(8.18.1)  .......  N 


(AT  -  i)(jv  -  2) 
and  an  unbiased  estimator  of  this  variance  is 

For  the  fourth  ^-statistic, 

(E(kA)=0 

(8-18.3)  \  _  24k2*N(N  -  l)2 

I    (  4)  "  (N  -  3)(N  -  2)(JV  +  3)(N  +  5) 

The  variances  of  g1  =  kjk23/2  and  of  g2  —  kjk22  (the  sample  skewness  and 
kurtosis  respectively)  were  worked  out  by  R.  A.  Fisher,  who  found  that 

6N(N  -  1) 


(8.18.4)  V(9l) 


(N  -  2)(N  +  1)(N  +  3) 
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24N(N-1)2 

v(g2)  = 


8.19 


(N  -  3)(N  -  2)(JV  +  3)(N  +  5) 
For  large  N  these  approximate  to  the  values  6/N  and  24/ N,  respectively. 

8.19  The  Distribution  of  Extreme  Values  It  is  sometimes  convenient  to 
judge  a  sample  by  the  largest  (or  smallest)  item  in  it.  If  the  observed  values  of  X 
for  the  sample  are  arranged  in  ascending  order, 

xx  <  x2  <  .  .  .  <  xN 

and  if  F(x)  is  the  distribution  function  for  the  parent  population,  the  probability 
that  xN  <  x  is  F(x)N.  This  is  true  because  if  xN  <  x,  the  same  inequality  must 
hold  for  all  the  other  values  in  the  sample,  which  are  all  supposed  to  be  indepen- 
dent. The  probability  that  the  largest  item  in  a  sample  of  size  N  from  a  standard 
normal  population  is  less  than  x  is  given  by 


(8.19.1) 
where 


P(xN  <  x)  =  [0(x)f 


0(x)=(2ti)- 


-■/: 


-«2/2 


du 


Values  of  the  lower  and  upper  percentage  points  of  the  largest  value  xN  have 
been  calculated  by  Tippett  and  Pearson  [8].  The  same  table  applies  to  the 
smallest  value  xt  with  a  change  of  sign  and  a  reversal  of  the  terms  "upper"  and 
"lower."  A  brief  extract  from  this  table  is  appended. 

Table  8.3 


Sample  Size 

Upper  *  Percentage  Points 

N 

5% 

1% 

5 

2.319 

2.877 

10 

2.568 

3.089 

15 

2.705 

3.207 

20 

2.799 

3.289 

30 

2.929 

3.402 

50 

3.082 

3.539 

100 

3.283 

3.718 

1000 

3.884 

4.264 

Such  a  table  is  useful  in  some  types  of  quality  control  problems.  If,  for  example, 
a  manufacturer  is  producing  a  certain  article  for  which  the  average  breaking 
strength  should  be  180  lb  with  a  standard  deviation  of  not  more  than  12  lb,  and 
if  routine  samples  of  size  10  are  tested,  the  lowest  value  in  a  sample  should  not  be 
below  180  -  12(3.089)  =  142.9  lb  more  than  once  in  100  times.  If  such  a  low 
value  is  observed  it  might  be  worth  while  to  look  into  possible  causes. 
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*  8.20  The  Rejection  of  Extreme  Observations  The  question  of  whether  to 
reject  an  extreme  value  (often  called  a  "straggler")  in  a  set  of  observations  is  one 
that  sometimes  poses  a  difficulty  in  experimental  work.  If  we  can  assume  that  the 
sample  comes  from  a  normal  population  of  known  mean  and  variance,  the 
distribution  of  the  extreme  value  as  given  in  §  8.19  will  enable  us  to  calculate 
the  risk  we  run  in  rejecting  the  straggler.  In  practice,  however,  the  mean 
and  variance  are  not  generally  known  and  we  must  substitute  estimates  derived 
from  the  sample  itself,  but  the  distribution  is  then  not  precisely  that  of  §  8.19. 

For  a  sample  of  size  N  with  mean  3c  and  standard  deviation  s,  the  distribution 
of 


(8.20.1) 


u.  =  (Xt  -  x)/s 


where  xt  is  the  value  of  X  for  the  straggler  suspected,  was  worked  out  by  W.  R. 
Thompson  [9].  He  found  that  the  quantity 


(8.20.2) 


U  =  ut 


N  -2 


(N  -  1)' 

N 


1/2 


has  the  Student-f  distribution  with  N  —  2  d.f.,  so  that  the  probability  of  such  a 
value  arising  by  chance  in  a  normal  parent  population  can  be  calculated.  This, 
however,  refers  to  a  single  observation  and  not  to  the  smallest  or  largest  in  a 
sample,  and  care  should  therefore  be  used  in  interpretation.  In  a  sample  of 
20  from  the  same  normal  population  we  could  expect  that  one,  by  pure  chance, 
would  reach  a  t- value  corresponding  to  probability  0.05.  As  a  rough  rule-of- 
thumb,  one  might  agree  to  require  a  probability  of  less  than  0.01  for  a  sample 
of  size  less  than  10  and  a  probability  of  less  than  0.005  for  one  of  size  10  to  20, 
before  rejecting  the  extreme  value. 

W.  J.  Dixon  [10]  has  suggested  the  use  of  a  simple  ratio  criterion  for  the 
rejection  of  xNi  and  this  requires  very  little  computation.  For  samples  of  size 
8  to  12  we  compute  rn  =  (xN  —  jcjv_i)/(xjv  —  x2),  and  for  larger  samples 
^22  =  (xn  ~  xn-2)I(xn  —  x3)-  If  the  ratio  exceeds  a  critical  value  Ra,  the 
probability  is  less  than  a  that  the  extreme  value  xN  comes  from  the  same  normal 
population  as  the  rest  of  the  observations.  When  the  lowest  value  in  the  set  is  the 
one  suspected,  the  observations  should  be  placed  in  reverse  order  so  that  xN  is 

Table  8.4 


rn 

/*22 

N 

8 

9 

10 

11 

12 

13 

14 

15 

16 

1?0.05 

.608 

.564 

.530 

.502 

.479 

.611 

.586 

.565 

.546 

-Ro.oi 

.717 

.672 

.635 

.605 

.579 

.697 

.670 

.647 

.627 
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still  the  straggler.  Table  8.4  is  a  brief  extract  from  the  tables  [11]  giving 
values  of  Ra  for  a  =  0.05  and  0.01,  and  N  from  8  to  16.  If  rt  t  (or  r22)  is  greater 
than  i^.05,  the  observation  xN  may  reasonably  be  rejected.  This  table  applies 
when  we  do  not  know  before  seeing  the  data  whether  we  shall  want  to  test  the 
highest  or  the  lowest  value,  and  so  corresponds  to  a  =  0.025  and  0.005  in 
Dixon's  tables. 

Example  7  The  following  15  observations  were  made  of  the  vertical  semi- 
diameter  of  the  planet  Venus  in  seconds  of  arc  (43.00"  have  been  subtracted  from 
each  reading  and  the  readings  have  been  rearranged  in  ascending  order) : 

-1.40,  -0:44,  -0.30,  -0.24,  -0.22,  -0.13,  -0.05, 
0.06,  0.10,  ,0.18,  0.20,  0.39,  0.48,  0.63,  1.01 

The  observation  —  1 .40  is  rather  suspiciously  low.  The  mean  of  all  the  readings 
is  0.018  and  the  standard  deviation  is  0.551  so  that,  from  Table  8.3,  the  lowest 
value  in  the  sample  should  not  be  below  0.018  —  2.705  (0.551)  =  —  1.47  more 
than  once  in  20  times,  if  the  sample  mean  and  variance  apply  to  the  population. 
The  observed  —1.40  is  therefore  not  too  unreasonable,  on  this  supposition. 
According  to  Thompson's  criterion,  ut  =  —2.51  A  and  /  =  —3.55,  with  13  d.f., 
and  the  probability  of  a  single  value  as  low  as  this  is  less  than  0.005.  We 
might,  on  this  criterion,  reasonably  reject  the  straggler.  If  we  do  so,  and  then 
test  the  remaining  14  observations,  the  largest  has  a  /  of  2.874  with  12  d.f.  Since 
the  probability  is  between  0.05  and  0.01,  we  should  be  chary  of  rejecting  it. 

Applying  Dixon's  criterion,  we  get  r21  —  0.585,  which  is  greater  than  the 
value  0.565  in  the  table,  and  so  would  lead  to  rejection.  After  rejecting  the 
lowest  value,  the  ratio  for  the  highest  remaining  value  is  0.424,  and  this  suggests 
retention  of  the  straggler. 

8.21  The  Distribution  of  the  Range  The  range  of  a  sample,  with  the  observed 
values  placed  in  ascending  order  of  size,  is  given  by 

(8.21.1)  R=xN-xl 

If  F(x)  is  the  distribution  function  for  the  parent  population,  the  probability 
that  N  —  2  values  lie  between  xx  and  xN  is  [F(xN)  —  FixJ]"'2.  The  probability 
that  one  specified  observation  has  the  value  xx  (to  xx  +  dx^)  is/ipc^)  dx,  and 
similarly  the  probability  for  one  specified  observation  to  be  equal  to  xN  is 
f(xN)  dxN.  Since,  however,  there  are  N(N  —  1)  ways  in  which  these  two  extreme 
observations  may  appear  in  the  original  order  of  the  observations  (xx  could  be 
in  any  one  of  N  places  and  xN  in  any  of  the  remaining  N  —  1  places),  the 
probability  of  a  sample  with  lowest  value  xx  and  highest  value  xN  is 

(8.21.2)  P(xu  xN)  =  N(N  -  l)[F(xN)  -  F(x1)]A^-2/(x1)/(^)  dxt  dxN 
Putting  xN  =  jq  +  R,  we  find  as  the  joint  probability  density  for  xx  and  R 

(8.21.3)  f(xu  R)  =  N(N  -  l)[F(Xl  +  R)  -  F0ci)']N'2f(xl)flxl  +  R) 
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Integrating  over  all  values  of  xu  we  obtain  the  probability  density  for  R, 
namely, 

(8.21.4)  g(R)  m  N(N  -  1)  f°°  [F(Xl  +  R)  -  F^)]*-2/^/^!  +  R)  dx, 

J  —  00 

The  distribution  function  for  R  is 

(8.21.5)  G(R)  =  j    g(u)  du. 

If  we  write  F(xt  +  u)  -  F(^x)  =  7,  then,  for  a  fixed  *!,</>>  =  /((xx  +  u)  du. 
Substituting  from  Eq.  (4)  in  Eq.  (5)  and  reversing  the  order  of  integration 
(which  is  legitimate  here),  we  obtain 


(8.21.6)  G(R)  =  N(N  -  1) 


ru=R 

Ju  =  0 


/(*i)         y'-'dydx, 


=  N 


f(x1)lF(x1+R)-F(x1)Y-1dx1 


The  expected  value  of  R,  E(R),  has  been  calculated  by  Tippett  (see,  e.g., 
[12],  page  338)  as 

(8.21.7)  E(R)  =         [1  -  FN  -  (1  -  Ff]  dx 

j  -00 

where  F  stands  for  F(x).    For  a  standard  normal  parent  population  (^  =  0, 

a  =  1),  F(x)  =  O(x)  =  (27r)"1/2|x    e""2/2  Jw.    If  the  range  in  a  sample  from 

this  population  is  denoted  by  w,  Tippett's  values  of  E(w)  for  TV  =  2(1)500(10) 
1000  are  given  in  Biometrika  Tables  for  Statisticians,  Vol.  /,  Table  27.  Examin- 
ation of  this  table  shows  that  for  N  between  350  and  550  the  value  of  E(w)  is 
close  to  6.  This  is  the  reason  for  the  common  practice  of  estimating  roughly  the 
standard  deviation  from  a  sample  of  several  hundred  items  as  one-sixth  of  the 
range. 

Values  of  G(w)  for  the  standard  normal  population  have  been  calculated  for 
AT  =  2  to  20  by  E.  S.  Pearson  and  H.  O.  Hartley  and  may  be  found  in  the 
Biometrika  Tables,  Table  23.  More  complete  tables  are  given  in  reference  [13]. 
The  expression  for  G(w)  may  be  reduced  to 

(8.21.8)  G(w)  =  [20 (-^  -  ll    -  2 AM      [O(w)  -  0(w  -  w)]N"  l<j){u)  du 

but  the  evaluation  must  be  carried  out  by  numerical  methods.  The  distribution 
does  not  approach  normality  as  N  increases. 

In  practice  the  range  is  used  mainly  for  small  samples,  of  size  5  or  10,  say, 
such  as  commonly  occur  in  the  applications  of  quality  control  in  industry.  The 
range  is  certainly  a  very  convenient  measure  of  dispersion  because  of  the 
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simplicity  of  its  calculation.  For  large  samples  the  distribution  becomes  very 
sensitive  to  departures  from  normality  in  the  parent  population,  particularly  as 
regards  kurtosis. 

*  8.22  Tests  of  Hypotheses  concerning  the  Variance  of  a  Normal  Population  by 
use  of  the  Sample  Range  By  comparing  the  observed  range  with  the  tabu- 
lated value  of  E(w)  for  a  standard  normal  population,  we  may  estimate  the 
standard  deviation  of  the  actual  normal  population.  For  samples  of  size  <  10 
the  efficiency  of  this  method  is  at  least  85  %,  and  the  calculation  is  very  easy. 
If  the  observed  range  is  R,  an  unbiased  estimator  of  o  is  given  by 


(8.22.1) 


<t  =  R/E(w)  =  kR 


The  values  of  k  for  a  few  sample  sizes  are  given  in  the  following  table,  which 
also  includes  upper  and  lower  0.5  percentage  points  for  w  =  R/a.  These  may  be 
used  in  establishing  99  %  confidence  limits  for  a  based  on  the  range  of  a  single 
sample. 

Table  8.5a 


TV 

k 

Lower  0.5% 

Upper  0.5% 

2 

0.886 

0.01 

3.97 

3 

.591 

.13 

4.42 

4 

.486 

.34 

4.69 

5 

.430 

.55 

4.89 

6 

.395 

.75 

5.03 

7 

.370 

.92 

5.15 

8 

.351 

1.08 

5.26 

9 

.337 

1.21 

5.3,4 

10 

.325 

1.33 

5.42 

15 

.288 

1.80 

5.70 

20 

.268 

2.12 

5.89 

"Extracted  from  Table  22,  reference  [8],  by  kind  permission  of  Professor  E.  S. 
Pearson  and  the  publishers  of  Biometrika. 

Thus  if  the  observed  range  in  a  sample  of  five  items  is  R  =  8,  the  estimate  of 
g  would  be  8(0.430)  =  3.44.  The  99  %  confidence  limits  for  a  would  be  8/(4.89) 
and  8/(0.55),  that  is,  1.64  and  14.5. 

If  we  wish  to  test  the  hypothesis  H0  (that  o  =  1)  against  the  alternative  Hx 
(that  a  >  1)  and  if  we  use  a  test  of  size  a,  the  critical  value  of  w  will  be  the  upper 
100  a  %  point.  Thus  for  a  =  0.005,  the  critical  value  for  a  sample  of  size  10  is 
5.42  (see  Table  8.5).  If  a  =  0.05  we  find  from  a  larger  table  that  the  critical 
value  is  4.47.  The  following  sample  of  random  normal  numbers, 

-2.015,  -0.623,  -0.699,0.481,  -0.586, 
-0.579,  -0.120,0.191,0.071,  -3.001 

has  a  range  w  =  3.482,  and  hence  the  hypothesis  H0  would  not  be  rejected  by 
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this  test.  The  test  is  less  powerful  than  that  based  on  the  sample  variance  (using 
a  table  of  chi-square)  but  is  easier  to  apply. 

*  8.23  The  Range  in  Samples  from  a  Rectangular  Distribution  If  the  parent 
population  is  rectangular,  so  that  (with  suitable  units)  F(x)  —  x,  0  <  x  <  1,  the 
distribution  function  for  the  range  can  easily  be  obtained  explicitly.  We  have 


(8.23.1) 


G(R) 


N      [Fix, 


+  K)-F(*1)]]V"1d;c1 


Now  F(xt  +  R)  =  jq  +  R  as  long  as  xt  +  R  <  1,  but,  when  xt  >  \  —  R9 
F(xy  +  R)  remains  equal  to  1 .  The  region  of  integration  for  a  given  R  must 
therefore  be  split  into  two  parts,  from  0  to  1  —  R  and  from  1  —  R  to  1  (see 
Figure  43).  Then, 


F(Xl+R) 


Fig.  43    Distribution  function  for  rectangular  distribution 

pi-r 

(8.23.2)  G(R)  =  N  \        (xt  +  R  -  x^'1  dxt 

+  N  (l-xj"-1  dxt 

J  l-R 

=  NRN~\1  -  R)  +  RN  =  iVR*"1  -(N-  1)RN 
The  probability  density  is 

(8.23.3)  g(R)  =  N(N  -  1)RN~2(1  -  R) 

which  has  a  maximum  at  R  =  (N  —  2)/(N  —  1).   The  expected  value  of  R  is 
(N  -  \)/(N  +  1). 

A  rectangular  population  is  not  quite  as  artificial  as  it  may  appear.  In  the 
production  of  machine  parts  in  a  factory  to  rather  narrow  specification  limits, 
when  only  those  articles  which  comply  with  the  specification  are  included 
in  the  population,  the  hypothesis  of  a  rectangular  distribution  seems  not 
unreasonable. 

*  8.24  A  Test  for  Equality  of  Two  Rectangular  Populations,  Based  on  the  Range 

If  Rt  and  R2  are  the  ranges  in  two  random  samples  of  sizes  7V\  and  N2,  assumed 
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to  come  from  rectangular  populations  of  widths  Ct  and  C2,  respectively,  the 
distribution  of  the  quotient  of  ranges  RJR2,  under  the  null  hypothesis  that 
Q  =  C2  =  C,  has  been  worked  out  by  Rider  [14]. 

The  probability-density  for  U  =  RJR2  turns  out  to  be  independent  of  C  and 
is  given  by 

[N1N2(Nl  -  1)(N2  -  1)]  x 

(8.24.1)  f(u)  =  [(N^N2)u^-{N^N2-2)u^l  " 

(N1  +  N2)(N1+N2-  1)(N1+N2  -2) 

and 

[NiNaCtf  t  -  l)(iV2  -  1)]  x 

(8.24.2)  /(„)-         [ffl1+Na)u^-(Nt+Na-2)u-^]t   ^.^ 

(JV,  +  NJiNt  +N2-  lXJVt  +  N2  -  2) 

The  expected  value  of  w  is  (A^  -  l)iV2/[(^i  +  1X#2  -  2)1- 

Rider  gives  a  table  for  the  quotient  of  ranges  which  will  be  exceeded  in  5  % 
of  random  samples,  and  this  table  may  be  used  for  testing  the  null  hypothesis. 

Example  8  The  width  of  a  slot  in  a  certain  airplane  part  was  measured  to 
the  thousandth  of  an  inch  in  a  sample  of  five  parts  on  the  first  day  of  production 
and  again  in  a  sample  of  10  parts  two  days  later.  The  results  (in  thousandths  of 
an  inch  in  excess  of  0.800  in.)  were 

(1)  77,80,78,72,78 

(2)  75,  77,  75,  76,  77,  79,  75,  78,  77,  76 

We  see  that  7v\  =  5,  Rl  =  8,  N2  =  10,  R2  =  4.  Then  u  =  2.  The  probability 
of  a  value  as  great  as  this  can  be  obtained  by  integrating  Eq.  (2)  from  u  =  2 
to  oo  and  is  0.0013.  It  appears,  therefore,  that  the  second  sample  is  pretty 
definitely  more  uniform  than  the  first.  The  5  %  critical  value  of  u  is  actually  1 .27. 
This  test  is  analogous  to  the  F-test,  discussed  in  §§8.15  and  8.16.  The  null 
hypothesis  that  the  quotient  of  population  ranges  is  1  is  tested  against  the 
alternative  hypothesis  that  the  quotient  is  greater  than  1 .  We  can  always  make 
u  >  1  by  choosing  for  sample  (1)  that  with  the  greater  range. 

8.25  The  Distribution  of  Order  Statistics  The  rth  order-statistic  of  a  sample 
of  size  Af  is  the  rth  smallest  variate- value  in  the  sample.  If  the  values  are  arranged 
in  ascending  order  of  size,  xr  <  x2  <  x3  .  .  .  <  xN9  the  rth  order-statistic  is  xr. 
For  a  sample  of  size  2r  +  1,  the  (r  +  l)th  order-statistic  is  the  median  (the 
middle  value). 

In  a  sample  of  size  N  from  a  population  with  a  continuous  distribution 
function  F{x),  the  probabilitxthat^c,.  =  x  (to  x  +  dx)  is 

(8.25.1)  g(x)  dx  =  C[F{xj\r~ l[l  -  F(x)]N"7(x)  dx 

since  there  are  r  —  \  observations  smaller  than  x  and  N  —  r  observations 
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larger  than  x.  The  constant  C  is  found  from  the  condition 


g(x)  dx  = 

J  —  00 


(8.25.2)  g(x)  dx  =  1 

J  —  00 

Writing  F(x)  =  F,f(x)  dx  =  dF,  and  noting  that  Fgoes  from  0  to  1  as  x  goes 
from  —  oo  to  oo,  we  see  that  Eq.  (2)  gives 

(8.25.3)  C      Fr-\1  -  F)N~r  dF  =  1 
whence 

(8.25.4)  C"1  =  fl(r,N-r  +  l) 

For  a  rectangular  parent  population,  F(x)  =  x  and  g(x)  is  simply  the  ordinary 
beta-distribution. 

For  a  normal  population,  the  study  of  the  distribution  involves  the  numerical 
calculation  of  certain  integrals.  Thus  for  the  median,  with  TV  =  2r  +  1,  and 
with  F(x)  =  O(x),  we  have 

(8.25.5)  g(x)  dx  =  C[0(x)]r  [1  -  0>W]r  rfO(x) 
where 

(8.25.6)  -C_1  =£(r  +  l,  r  +  1)  =  (r!)2/AM 

Note  that  the  r  of  Eq.  (1)  is  now  replaced  by  r  +  1,  since  the  median  is  xr+1 
when  N  —  2r  -\-  1.  For  TV  =  2r,  the  median  is  taken  as  (*r  +  xr+l)/2. 

It  is  obvious  from  the  symmetry  of  the  parent  population  (which  we  have 
taken  as  standardized)  that  the  expected  value  of  the  median  will  be  zero.  The 
variance  is 

(8.25.7)  V(xr+l)=r  x2g(x)dx 

J  -oo 

J\M    f  °° 
=  (H?J      x2°r(1-°)r*W^ 

where  <£(*)  =  (2n)~l,2e~x2/2,  and  <D  is  written  for  <!>(x).  If  (1  -  0)r  is  expanded 
binomially  as 

(8.25.8)  (1 -<&)'=  Zo(-iy*(;r)^' 

integration  by  parts,  with  x  e~x2/2  as  one  part,  yields  the  result 

AM 


(8.25.9)  V(xr+1)  =  1  +~p  Z  (-DyQ(r  +jj(r  +j  -  l)Tr+j. 
where 

(8.25.10)  Tr+j_2  =  r  Or+J-2(27r)-1/2  exp(-3x2/2)  dx 

J  —  oc 
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By  calculating  the  different  integrals  of  this  type,  the  variance,  and  also 
higher  moments,  may  be  obtained.  An  approximate  expression  for  the  variance 
of  the  median  in  odd  samples  from  a  standard  normal  population  is 

(8.25.11)  nxf+1)=^^+4(JV+2"2(JV+4) 

For  a  sample  of  size  1 1  this  gives  0.133,  the  correct  value  being  0.137. 

The  corresponding  variance  of  the  mean  is  0.091,  so  that  the  efficiency  of  the 
median  in  a  sample  of  this  size  is  66.4  %.  As  TV  increases,  the  efficiency  tends  to 
the  value  2/n  =  0.637.  On  the  average,  therefore,  we  can  get  about  as  good  a 
value  of  the  population  mean  from  the  median  of  100  observations  as  from  the 
mean  of  64. 

If  ft  is  the  population  median,  for  a  population  with  density  fix),  so  that 


(8.25.12) 


f(x)  dx- 1/2 

J  —  oo 


then  as  N  ->  oo  the  sample  median  is  approximately  normally  distributed  with 
mean  p,  and  variance  [4(N  +  2)/2(/2)]-1. 

For  the  rectangular  distribution,  for  which  f(p)  =  1,  this  value  is  exact.  For 
the  normal  distribution,  f(p)  =  (2n)~l,2i  and  the  approximation  is  n/[2(N  +  2)]. 

*  8.26  The  Asymptotic  Distribution  of  the  Extreme  Value  The  distribution 
of  the  largest  value  in  a  sample  is  of  interest  in  certain  applications,  as  for 
instance  in  predicting  the  occurrence  of  exceptional  floods  in  river  flow.  If  in 
Eq.  (8.25.1)  we  put  r  —  N,  the  probability  density  for  the  largest  value  is  found, 
as  in  §  8.19,  to  be 

(8.26.1)  g(x)  =  N[F(x)f-lf(x)  dx 
and  the  distribution  function  is 

(8.26.2)  G(x)  =  [F(x)Y 
Let  x0  be  defined  by  the  relation 

(8.26.3)  N[l  -  F(jc0)]  =  1 

Since  N[\  —  F(x0)]  is  the  expected  number  of  values  exceeding  x0  in  a 
sample  of  size  N,  Eq.  (3)  states  that  in  such  a  sample  we  may  expect  x0  to  be 
exceeded  just  once. 

We  will  first  suppose  that  the  distribution  in  the  parent  population  is 
exponential,  so  that 

F(x)  =  1  -  e-ax9       fix)  =  0Le~ax,        and  eaxo  =  N 

Therefore, 

[F(x)f  =  [1  -  e~axf  =  [l  - 1  e-«<*-*°>l 
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and  the  limit  of  this  as  TV  •-*  oo  is  exp[  —  e~"(*"Xo)].   We  have  as  the  limiting 
distribution,  therefore, 

(8.26.4)  lim  G(x)  =  exp[-e_a(x-Xo)] 
and,  consequently, 

(8.26.5)  lim  g(x)  =  ae~a{x~Xo)  exp[-e~a(x_Xo)] 
If  y  —  a(x  —  x0)  the  density  function  for  y  is 

(8.26.6)  h(y)  =  e-yexp(-e-y) 

which  is  a  form  used  by  Gumbel  in  a  study  of  floods  [15]. 
For  the  normal  distribution, 

f(x)=(2ny1/2e-x2/2 
and 


(8.26.7)  1  -  F(x0)  =  (2tt)-1/2  |    e~u2/2 


du 


The  integral  on  the  right  of  Eq.  (7)  is  asymptotically  equivalent  to 
f(x0)(l/x0  -  1/V  +  :*.  . ),  so  that  from  Eq.  (3) 


l=(27t)-1/Va^(i__L  +  ...) 


N 
or 

-ix02  «  log  x0  +  i  log(27i)  -  log  N 

Using  only  the  leading  terms  for  large  N, 

(8.26.8)  x02  «  2  log  TV 

In  the  exponential  distribution  of  Eq.  (4),  a-1  =  [1  —  F(xoy]/f(x0).  The 
corresponding  expression  for  the  normal  law  is  asymptotically  equal  to  x0_1, 
and  in  fact,  as  proved  by  Fisher  and  Tippett  [16]  the  limiting  distribution 
of  the  extreme  value  is  the  same  as  that  of  Eq.  (4)  with  a  =  x0,  where  x0  is 
given  by  (8). 

8.27  The  Effect  of  Non-Normality  Since  we  do  not  usually  know  whether  a 
sample  comes  from  a  normal  universe  or  not,  it  is  natural  to  ask  what  difference 
it  would  make  in  the  f-test  or  the  F-test  if  the  universe  were  not  normal.  Bartlett 
[17]  and  others  have  shown  that  the  f-test  gives  quite  good  results  even  for  con- 
siderable departures  from  normality,  although  the  one-tailed  test  is  more 
vulnerable  in  this  respect  than  the  two-tailed  test.  For  a  skew  parent  population 
the  true  significance  level  may  be  considerably  under-  or  overestimated  by  using 
the  ordinary  tables.  The  effects  of  skewness  and  kurtosis  in  the  parent  population 
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on  the  power  function  of  the  f-test  have  been  discussed  by  Srivastava  [18].  A 
positive  skewness  tends  to  reduce  the  power  when  the  power  is  low  and  increase 
it  when  it  is  high;  a  negative  skewness  has  the  opposite  effect.  Kurtosis,  unless 
quite  marked,  seems  to  have  comparatively  little  effect.  With  increase  in  sample 
size,  of  course,  the  effect  of  non-normality  diminishes. 

Some  experimental  work  on  the  F-test  for  samples  from  non-normal  popu- 
lations was  carried  out  by  E.  S.  Pearson,  and  this  suggested  that  the  test  may  be 
used  with  populations  differing  quite  considerably  from  normal  without  serious 
error.  W.  G.  Cochran  [19]  has  discussed  the  effect  of  non-normality  on  the  f-test 
and  F-test,  and  concludes  that  a  tabular  5%  may  perhaps  mean  anything  from  4  % 
to  7%  and  a  tabular  1  %  anything  between  \%  and  2%.  The  effect  of  non- 
normality  is  usually  to  increase  the  apparent  significance  of  results,  which 
suggests  caution  when  interpreting  results  near  the  borderline  of  significance. 

Unless  data  are  very  extensive,  it  is  seldom  possible  to  demonstrate  that  they 
are  not  normal.  The  standard  errors  of  skewness  and  kurtosis  are  so  large  with 
samples  of  moderate  size  that  only  very  marked  non-normality  could  be  detected. 
If  there  is  reason  to  suspect  non-normality,  from  the  nature  of  the  data,  it  is 
advisable  to  try  a  transformation.  The  logarithm  of  the  variate,  or  the  square 
root,  or  the  inverse  sine,  may  be  more  nearly  normal  (see  Chapters  3  and  4, 
§§3.15,  3.16  and  4.8,  and  also  reference  [20]). 


PROBLEMS 

A.  (§§  8.1-8.4) 

1.  If  X  is  normally  distributed  with  mean  0  and  variance  1,  and  if  m  is  the  mean  of 
a  random  sample  of  16  items,  show  that  the  odds  are  about  370  to  1  against  obtaining 
an  m  numerically  greater  than  f. 

2.  Assume  that  the  mean  age  at  death  of  men  who  are  alive  at  age  20  is  59.13  years, 
with  a  standard  deviation  of  10.2  years.  An  insurance  company  would  like  to  feel 
fairly  sure  (probability  at  least  0.99)  that  the  mean  age  at  death  in  its  own  group  of 
men  aged  20  will  not  differ  from  59. 1 3  years  by  more  than  1  year.  Assuming  a  normal 
distribution,  how  large  should  the  group  be  ? 

3.  The  mean  of  a  particular  normal  distribution  is  equal  to  the  standard  deviation 
of  the  mean  of  samples  of  100  from  the  same  distribution.  Find  the  probability  that  the 
mean  of  a  sample  of  25  will  be  negative. 

4.  How  large  a  sample  should  be  taken  from  a  normal  population  if  the  probability 
is  to  be  0.95  that  the  sample  mean  will  not  differ  from  the  population  mean  by  more 
than  one-quarter  of  the  population  standard  deviation? 

5.  Prove  that  the  density  function  of  the  statistic  k2,  for  given  a,  has  a  maximum  at 
kz  =  (n  —  2)o2/n,  where  n  —  N  —  1.  (Hence  the  estimator  nki\{n  —  2)  has  the 
property  that  its  most  likely  value  is  the  true  value  a2.) 

6.  Show  that  the  moment  generating  function  of  the  distribution  of  ki  is  M(h)  = 
(1  —  2ha2/n)-n/2.  Hence  obtain  the  c.g.f.,  and  the  expectation  and  variance  of  &2. 

7.  For  what  value  of  a  is  the  expectation  of  (afo  —  a2)2  a  minimum?  (The  quantity 
afo  is  a  "least  squares"  estimator  of  a2,  different  from  the  unbiased  estimator  and  the 
one  in  Problem  5). 
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8.  If  s  is  the  positive  square  root  of  ki,  prove  that 


CW  = 

:CTW  n  2  ;/^5; 

and 

V(S)   =  <72 

['-Mf] 

Hence  show  that 

«,>.„(i  -i  +  ol) 

and 

"•-©(>-i  +  «i) 

Hint:  Use  the  Stirling  formula, 

loglX*) 

=  ilog 2W +(*'-!)  log* -*  +  Ii; 

to  evaluate  log  (r(^-T^)  /r(?)i  See  <8-6-4)- 


B.  (§§8.5-8.13) 

1.  Four  different  boxes  of  Eddy's  matches,  from  the  same  carton,  contained  55,  58, 
53  and  57  matches.  Obtain  95  %  confidence  limits  for  the  mean  number  of  matches  in 
boxes  of  the  same  kind. 

2.  The  tensile  strength  (X),  in  pounds,  of  a  certain  type  of  cable  was  measured  for 
12  samples.  The  results  were :  182,  178,  185,  184,  180,  179,  177,  185,  174,  179,  183,  186. 
Calculate  90%  confidence  limits  for  the  mean  of  X  in  this  type  of  cable.  Hint:  Use  the 
auxiliary  variate  U  =  X  —  180. 

3.  A  machine  producing  mica  insulating  washers  is  supposed  to  turn  them  out  with 
a  mean  thickness  of  10  mils  (1  mil  =  0.001  in.).  A  random  sample  of  nine  washers 
from  the  output  of  this  machine  has  a  mean  thickness  9.5  mils  with  a  standard  deviation 
0.60  mil.  Is  the  output  significantly  different  from  standard  with  respect  to  thickness  ? 

4.  In  the  course  of  archaeological  investigations  at  a  certain  site,  16  lower  first 
molars  were  found  with  mean  length  13.57  mm  and  standard  deviation  0.72  mm.  From 
a  near-by  site,  nine  lower  first  molars  were  taken  with  mean  13.06  mm  and  standard 
deviation  0.62  mm.  Can  the  two  finds  be  reasonably  regarded  as  samples  of  the  same 
population? 

5.  Two  samples  of  herring  were  measured  for  length  (mm)  with  the  following 
results : 

(1)  192,  179,  181,  193,  215,  181,  178 

(2)  173,  194,  194,  187,  168,  186,  176,  191,  191,  178,  185,  160 

Find  95%  confidence  limits  for  the  difference  in  the  mean  lengths  for  the  two 
populations  sampled. 

6.  Twelve  hogs  were  fed  on  diet  A,  fifteen  others  on  diet  B.  The  gains  in  weight 
for  the  individual  hogs  in  pounds,  over  the  same  period,  were  as  follows : 

A  25,  30,  28,  34,  24,  35,  13,  32,  24,  30,  31,  35. 

B  44,  34,  22,  18,  47,  31,  40,  30,  32,  35,  18,  21,  35,  29,  22. 
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On  the  assumption  that  the  diet  may  affect  the  mean  gain  without  affecting  the 
variance  of  gains,  obtain  90  %  confidence  limits  for  the  average  increase  in  gain  with 
diet. B  over  that  with  diet  A. 

7.  A  physiological  experiment  was  carried  out  to  test  the  effect  of  an  injection  of 
secretin  on  the  percentage  of  reticulocytes  in  the  blood  of  rabbits.  Seventeen  rabbits 
were  tested,  before  and  after  injection,  and  the  mean  increase  was  0.0635.  The  standard 
deviation  of  the  increases  was  0.168.  Was  there  a  significant  effect  at  the  5%  level? 

8.  A  paired  feeding  experiment  on  pigs  was  conducted  to  determine  the  relative 
value  of  limestone  and  bone  meal  for  bone  development.  The  variable  is  the  percentage 
ash  content  in  the  shoulder-blade. 


Pair  Number 

Limestone 

Bone  Meal 

1 

49.2 

57.5 

2 

53.3 

54.9 

3 

50.6 

52.2 

4 

52.0 

53.3 

5 

46.8 

57.6 

6 

50.5 

54.1 

7 

52.1 

54.2 

8 

53.0 

54.3 

Is  the  difference  between  the  two  sets  of  values  significant  at  the  5  %  level  ? 

9.  (Snedecor)  An  agronomist,  interested  in  the  effect  of  superphosphate  on  the  yield 
of  corn,  added  the  fertilizer  to  a  mixture  of  manure  and  lime.  Five  pairs  of  adjacent 
plots  were  used  for  the  trial,  the  plots  in  each  pair  being  as  alike  as  possible  except  that 
one  was  treated  with  the  old  fertilizer  (without  superphosphate)  and  the  other  with  the 
new.  The  plots  with  the  new  fertilizer  yielded,  respectively,  20,  6,  4,  3  and  2  bushels 
per  acre  more  than  the  corresponding  controls.  Was  the  value  of  the  superphosphate 
demonstrated  ? 

If  the  increased  yields  had  been  5,  6,  4,  3  and  2  bushels  per  acre,  would  the  verdict 
have  been  different  ?  Explain  the  apparent  paradox. 

10.  Complete  the  proof  of  the  statement  in  §  8.5  that  Student's  /-distribution  tends 
to  normal  as  n  ->  oo.  Hint:  See  Eq.  (8.6.4). 

11.  If  /  =  n1/*  cot  </>,  show  that  the  density  function  for  (/>  is  C  sinw_1</>(0  <  <j>  <  w), 
where  C  =  1/5(1/2,  n/2). 

12.  A  sample  of  size  20  is  used  for  testing  the  hypothesis  that  \i  =  0  against  the 
alternative  hypothesis  that  /x  =  0.5a,  where  a  is  the  population  standard  deviation. 
If  Student's  /  is  used  as  the  criterion,  and  the  size  of  the  test  is  0.05,  what  is  the  power? 
Hint:  Use  the  approximation  of  Eq.  (8.12.5). 

13.  It  is  desired  to  test  the  hypothesis  that  \x  =  0  against  the  alternative  hypothesis 
that  /x  =  /xi(/xi  >  0).  If  the  standard  deviation  of  the  population  is  10  units  and  a 
sample  of  size  17  is  used,  find  the  least  value  of  \n  that  could  be  detected  by  the  /-test, 
assuming  that  the  risks  for  both  kinds  of  error  are  not  more  than  0.05. 

14.  Two  samples,  each  of  size  10,  come  from  populations  with  means  /xi  and  \x,i  and 
a  common  variance  a2.  The  null  hypothesis  Ho  is  that  /X2  —  /xi  <  0  and  the  alternative 
Hi  is  that  jU2  —  in  =  ka(k  >  0).  Show  that  if  a  is  not  greater  than  0.05,  a  value  of  k 
at  least  1.37  could  be  detected  by  the  /-test,  with  power  at  least  0.9.  Hint:  If  the  two 
samples  have  means  rrn  and  m%  and  variances  si2  and  S22,  the  quantity  {m%  —  mi)/ 
{(si2  +  522)/10}1/2  has  the  /  distribution  with_18  d.f.  if  [jh  =  ^2.  Under  Hi  it  has  a 
non-central  /-distribution  with  parameter  k\/5,  since  the  variance  of  mi  —  rm  =  <x2/5. 

15.  A  large  lot  of  manufactured  articles  is  rated  on  the  percentage  p  of  defective 
items,  an  item  being  reckoned  defective  if  the  value  of  a  normal  variate  X  is  at  least  3. 
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A  prospective  purchaser  will  want  to  reject  a  lot  with  probability  0.95  if  p  >  2.5.  A 
sample  of  10  items  is  used  for  a  non-central  /-test.  What  criterion  should  be  used  for 
accepting  the  lot?  Hint:  Find  8  from  Eq.  (8.13.3);  k  can  be  obtained  from  the  tables 
of  non-central  /,  using  Eq.  (8.13.5)  with  P  =  0.05,  or  approximately  from  Eq.  (8.12.5) 
with  zp  =  1.645  and  tm  =  &V10. 

C  (§§  8.14-8.17) 

1.  The  variance  of  a  random  sample  of  size  5  is  29.83.  Calculate  90%  confidence 
limits  for  the  population  variance. 

2.  Two  chemists,  A  and  B,  each  repeat  a  protein  analysis  20  times.  If  the  sets  of 
values  obtained  are  denoted  by  Xu  Yi,  respectively,  it  is  found  that  ZXt  =  196.40, 
XXi2  =  1928.6560,  XYi  =  205.16,  21V2  =  2104.7152.  Determine  whether  there  is  a 
significant  difference  in  precision  between  the  two  sets  of  analyses,  precision  being 
inversely  proportional  to  the  variance. 

3.  In  two  series  of  hauls  to  determine  the  number  of  plankton  organisms  inhabiting 
the  waters  of  a  lake,  the  following  results  were  found : 

Series  I:  80,  96,  102,  77,  97,  110,  99,  88,  103,  108 

Series  II:  74,  122,  92,  81,  104,  92,  90. 
In  Series  I  the  hauls  were  made  in  succession  at  the  same  place;  in  Series  II  they  were 
made  at  different  points  scattered  over  the  lake.   Does  there  appear  to  be  a  greater 
variability  between  different  places  than  exists  at  different  times  at  the  same  place? 

4.  For  the  data  on  feeding  of  hogs  in  Problem  B-6,  determine  whether  the  assump- 
tion of  a  common  variance  under  both  diets  is  justified. 

5.  If  x  =  «2/(/22  +  niF)  prove  that  x  is  a  beta-variate  with  parameters  /12/2,  wi/2. 

6.  When  «i  =  2,  show  that  the  upper  significance  level  of  F  corresponding  to 
probability/?  is  n%{p~2ln^  —  l)/2.  Hint:  The  integral  of  g{F)  from  Fi  to  00  is/?,  where 
Fi  is  the  required  level. 

7.  Find  the  upper  5  %  point  for  F  with  2  and  4  degrees  of  freedom  by  direct  integra- 
tion of  g{F).  Compare  with  the  value  in  Table  B.5  in  the  appendix. 

8.  What  is  the  smallest  ratio  A  of  two  variances  (A  >  1)  that  can  be  detected  by  an 
F-test  with  two  samples  of  size  10,  the  size  of  the  test  being  0.05  and  the  power  0.95? 
Hint:  When  m  =  m,  the  95  %  point  for  F  is  the  reciprocal  of  the  5  %  point. 

9.  An  approximation  to  Fisher's  z  for  given  P,  m  and  m  has  been  devised  by  A.  H. 
Carter,  namely, 


(h  +  kyi'< 

ZF  =  Zp ; 


G-3K-J) 


where  s  =  I//11  +  l//*z,  h  =  2/s,  k  =  (zp2  —  3)/6  and  zp  is  the  normal  standard  variate 
exceeded  with  probability  P.  Use  this  approximation  for  m  =  m  =  19  to  find  zf,  and 
hence  F,  when  P  =  0.25.   Then  determine  the  smallest  variance  ratio  detectable  with 

two  samples  of  size  20  and  a  test  of  size  0.05  and  power  0.25.  Hint:  zf  =  -  loge  F. 

D  (§§  8.18-8.26) 

1.  A  sample  of  10  observations  is  taken  from  a  normal  population  with  mean  250  and 
standard  deviation  10.  What  value  for  the  largest  member  of  the  sample  would  be 
exceeded  only  once  in  20  samples;  what  value  only  once  in  100  samples? 

2.  A  quantity  is  measured  10  times  with  the  following  results:  236,  251,  249,  252, 
248,  254,  246,  257,  243,  274.  Should  the  largest  of  these  observations  be  rejected  accord- 
ing to  Thompson's  criterion  ? 

3.  In  the  following  measurements  of  an  angle  (degrees  and  minutes  omitted,  values 
in  seconds  of  arc)  would  it  be  reasonable  to  reject  the  lowest  reading?  51.75,  47.85, 
47.40,  48.90,  44.45,  48.45,  51.05,  48.85,  50.95,  50.60,  47.75,  49.20,  50.55. 
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4.  Apply  Dixon's  criterion  to  the  data  of  Problems  2  and  3. 

5.  The  following  frequency  distribution  was  found  for  the  range  R  in  200  samples 
of  size  10  from  an  artificial,  approximately  normal,  population  with  mean  20  and 
standard  deviation  4 : 


R 


5      6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23 


/ 


2      4      4    14    11    20    25    28    25    17    13     13      5      9      3      3      3      0      1 


Calculate  the  mean  range  and  estimate  the  standard  deviation  of  the  population, 
assuming  that  the  true  value  4  is  unknown.  Hint:  The  estimator  of  a  is  k R,  where  k 
is  found  from  Table  8.5 

6.  Let  u  denote  the  mid-range  of  a  sample,  that  is,  u  =  (xi  +  xn)/2.  For  the  rect- 
angular population  with  density /(i)  =  £,  —1  <  x  <  1,  the  density  for  the  mid-range 
is  given  by  g(u)  =  £iV(l  —  l^l)^"1.  Prove  that  the  variance  of  the  mid-range  is 
2/[(N  +  1)(7V  +  2)].  Hence  show  that  the  mid-range  is,  for  all  N  >  2,  more  efficient 
than  the  arithmetic  mean  as  an  estimator  of  the  population  mean.  Hint:  Separate  the 
interval  of  integration  into  two  parts,  —  1  to  0,  and  0  to  1 . 

7.  A  sample  of  size  TV  is  taken  from  the  exponential  population  with  density  e~x, 
x  >  0.  Find  the  density  function  of  the  range  and  show  that  its  expected  value  is 
1  +  £  +  i  +  .  .  .  +  1/(JV  -  1).  Hint:  In  the  integral  for  E(R)  put  u  =  1  -  e~R,  and 
expand  log  (1  —  u)  in  a  series. 

8.  Samples  of  size  4  are  taken  from  the  population  of  Problem  7.  Find  95  % 
confidence  limits  for  the  range  in  such  samples. 

9.  For  a  sample  of  size  TV  from  the  rectangular  distribution  f(x)  =  \/b,  0  <  x  <  b, 
show  that  Rib  is  a  beta-variate  with  parameters  TV  —  1  and  2.  Hence  obtain  the  mean 
and  variance  of  the  distribution  of  R. 

10.  Numbers  are  drawn  at  random  from  the  interval  (0,1).  How  many  are  required 
before  the  probability  will  exceed  0.95  that  the  range  of  the  sample  will  be  at  least  0.5  ? 
Hint:  Show  that  N  is  given  by  the  inequality  2N~2  >  5(N  +  1).  Solve  by  trial  for  small 
values  of  N. 

11.  A  sam*ple  of  odd  size  N(  =  2r  +  1)  is  taken  from  the  rectangular  population 
with  density  1,  0  <  x  <  1.  The  median  is  the  (r  +  l)th  member  of  the  sample  when 
arranged  in  ascending  order.  Prove  that  the  expectation  of  the  median  is  \  and  its 
variance  is  1/(47V  +  2). 

12.  Prove  that  the  density  function  for  the  range  w  in  samples  of  size  3  from  a 
standard  normal  population  is  g(w)  =  (3/7T1^)e-w2^[^(w/\/6)  -  hi  Hint:  Use  Eq. 
(8.21.4)  with  F(x)  =  <D(x),  and  w  for  R.   Put  xi  =  (v  -  w)/2  and  obtain 


g(w)  =  3(2tt)-3/2  e~w2!4         e~v2^ 


f 


\v  +  w)/2 

e~ull2  dudv 

(v-w)/2 


Change  to  oblique  coordinates  x  —  u  —  v/2,  z  =  v\/5j2  and  integrate  over  the  strip 
of  horizontal  width  w,  x  going  from  —  w/2  to  w/2  and  z  from  —  oo  to  a>. 
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Chapter  9 

ANALYSIS  OF  VARIANCE 


9.1  Tests  of  Homogeneity  of  Variance  The  analysis  of  variance  is  a  widely 
used  technique  for  separating  the  observed  variance  in  a  group  of  samples  into 
portions  which  are  traceable  to  different  sources.  Thus  the  different  samples 
may  have  undergone  different  treatments  which  possibly  affect  the  general  level 
of  the  measured  variate  X.  If  all  the  samples  are  lumped  together  into  one  grand 
sample,  the  observed  variance  will  be  partly  due  to  differences  between  the 
individual  members  of  the  same  original  sample  and  partly  due  to  the  effects 
of  the  different  treatments.  The  method  of  analysis  of  variance  enables  us  to 
estimate  how  much  of  the  variance  is  attributable  to  the  one  cause  and  how 
much  to  the  other,  and  so  to  decide  whether  or  not  the  treatments  have  produced 
any  significant  effects. 

Much  more  elaborate  experimental  designs  than  this  can  be  analysed  by 
comparable  methods,  and  several  of  the  more  usual  designs  will  be  considered 
in  this  chapter.  All  the  common  analysis-of-variance  tests  rest  on  certain 
assumptions,  such  as  normality  of  the  distribution  of  X  and  additivity  of  treat- 
ment effects,  and  among  these  assumptions  is  one  on  the  constancy  of  variance 
as  between  samples.  It  is  supposed  that,  apart  from  possibly  affecting  the 
average  value  of  X,  the  different  treatments  (for  instance)  do  not  change  the 
sample  variances.  Methods  which  do  not  depend  on  these  assumptions  will  be 
mentioned  later  on,  but  meanwhile  a  test  for  the  homogeneity  of  variance  as 
between  a  group  of  samples  will  be  considered. 

For  two  samples,  the  technique  of  §  8.15  may  be  used.  We  suppose  therefore 
that  we  have  k  samples  (k  >  2),  and  that  for  the  ith  sample,  of  size  Ni9  the 
observed  values  are  xij9j  =  1,  2  .  .  .  N&  i  —  1,  2  ...  k.  We  assume  that  all  the 
samples  are  independent  and  come  from  normal  populations  with  means  filt 
\i2  . .  .  fik  and  variances  o^2,  a22  •  .  •  vk2.  The  null  hypothesis  H0  is  that 
Gi2  =  °i2  =  •  •  ■  =  0fc2  (=  0"2>  say).  The  alternative  hypothesis  Hx  is  that  these 
variances  are  not  all  equal. 

Under  H0,  the  likelihood  function  is 


\i^¥)'\ 


(9.1.1)  L°  =  (Ina2)"11  6XP 

where  N  =  £  Nj.  Under  Hu  the  likelihood  function  is 

(271)        (Ti        .  .  .  Gk  L  i,j   \        Oi        J 

214 


9.1  ANALYSIS  OF  VARIANCE  215 

The  joint  maximum  likelihood  estimators  of  /zf  and  a  under  H0  are 


(9.1.3) 


(9.1.5)  |A2 


(9.1.6)  /-T^—  = 77^yv72 =  L 


where 

(9.1.4)  S,=Z(*o-*i)2 

which  is  the  sum  of  squares  of  deviations  from  the  mean  for  the  ith  sample.  The 
sample  variance  is  vt  =  SJrii,  where  nt  =  Nt  —  1. 

The  joint  maximum  likelihood  estimators  of  fit  and  ct  under  Ht  are 

A    =  *« 

The  likehhood  ratio  is,  therefore, 

(L0)max  =  (SJN,)^2  .  .  .  (SJNk)N*>2 
(LO^x  (SIN)1 

where  S  =  J]  S^.    Then  i/0  will  be  rejected  if  L  <  c.    The  constant  c  is  so 
chosen  that 

(9.1.7)  P(L  <  c\H0)  <  oc 

As  mentioned  in  §  6.9,  the  distribution  of  —  2  log  L  for  large  N  is  approxi- 
mately x2  with  k  —  1  degrees  of  freedom.  This  number  is  the  number  of 
parameters  under  Hl9  namely,  2k,  less  the  number  under  H0,  namely,  k  +  1. 
In  this  case, 

(9.1.8)  -2  log  L  =  N  log  |  -  X  AT,  log ^ 

It  was  shown  by  Bartlett  [1],  that  the  approximation  to  y2  may  be  improved 
by  replacing  each  Nthy  nt  (=  Nt  —  1)  and  therefore  N  by  n  (=  N  —  k).  In 
effect  this  replaces  the  maximum  likelihood  estimators  by  unbiased  estimators. 
Furthermore,  the  approximation  will  hold  reasonably  well  down  to  values  of  nt 
as  small  as  4  or  5  if  a  correcting  factor  is  introduced  in  Eq.  (8).  We  can  there- 
fore in  most  cases  assume  that  the  quantity 

(9.1.9)  M  =  C~ x  \n  log  -  -  Y  nt  log  ^1 

L  n      *-  nj 

is  distributed  like  %2  with  k  —  1  d.f.,  where 

9.1.10)  _      ,^nt      » 

C_1+3(fe-l) 
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A  small  value  of  L  will  mean  a  large  value  of  M,  since  M  «  — 2  log  L,  and 
this  will  lead  to  rejection  of  the  null  hypothesis.  It  should  be  noted  that  the 
logarithms  in  Eq.  (9)  are  to  base  e. 

For  even  smaller  values  of  nt  (down  to  2,  say)  an  improved  approximation 
was  given  by  Hartley  [2],  and  tables  based  on  this  approximation  have  been 
compiled  by  Catherine  Thompson  and  Maxine  Merrington  [3]. 

Example  1  To  test  the  effect  of  a  small  proportion  of  coal  in  the  sand 
used  for  making  concrete,  several  batches  were  mixed  under  practically  iden- 
tical conditions  except  for  the  variation  in  the  percentage  of  coal.  From  each 
batch,  four  cylinders  were  made  and  tested  for  breaking  strength  in  lb/in2. 
One  cylinder  in  the  third  sample  was  defective,  so  there  were  only  three  items 
in  this  sample.   The  results  are  given  in  Table  9.1. 

Table  9.1 


Sample  No. 

1 

2 

3 

4 

5 

Percentage  coal 

0 

0.05 

0.1 

0.5 

1.0 

Breaking  strengths 

1690 
1580 
1745 
1685 

1550 
1445 
1645 
1545 

1625 
1450 
1510 

1725 
1550 
1430 
1445 

1530 
1545 
1565 
1520 

Mean 

1675 

1546 

1528 

1538 

1540 

Si/m 

4750 

6673 

7908 

18,475 

383 

m  logio  Silm 

11.03 

11.47 

7.80 

12.80 

7.75 

From  this  table,  we  obtain  the  values  n  =  14,  S/n  =  7619,  n  log  (S/ri)  —  £  nt 
log  (SJnt)  =  2.303  [54.35  -  50.85]  =  8.06.  Also  £(1//!,)  =  4/3  -  1/2  =  11/6, 
so  that  C  =  1.15  and  M  =  8.06/1.15  =  7.02.  With  4  d.f.,  this  value  of  x2 
corresponds  to  a  P  of  0.13,  so  that  even  the  rather  large  differences  in  the 
estimates  of  variance,  given  by  SJn^  in  the  above  table,  are  not  really  significant 
in  view  of  the  small  sample  sizes. 

Thompson  and  Merrington's  tables  should  preferably  be  used  for  values  of  nt 
as  small  as  those  appearing  in  the  above  table.  These  give  the  5  %  point  for  the 
distribution  of  —  21ogL  as  about  10.7  and  the  1%  point  as  about  14.9.  The 
observed  value,  8.06,  is  therefore  not  significant  at  the  5%  level. 


9.2  A  Test  for  Difference  of  Means  in  k  Samples  The  simplest  application 
of  analysis  of  variance  occurs  in  the  problem  of  deciding  whether  a  group  of 
samples  come  from  populations  which  differ  from  one  another  in  respect  of 
their  mean  values  of  some  measured  variate  X.  It  is  assumed  that  they  do  not 
differ  as  regards  the  variance  of  X,  and  this  homogeneity  of  variance  may 
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be  tested  by  the  methods  given  in  §9.1.  An  example  is  the  measurement  of 
breaking  strength  on  the  five  samples  of  concrete  cylinders  in  Table  9.1,  in 
which  the  test  of  homogeneity  has  been  shown  to  be  satisfied. 

If  Xij  is  the  measured  value  of  X  on  the/h  member  of  the  ith  sample,  j  =  1, 
2  . . .  Ni9  i  =  1,  2  ...  A:,  the  mathematical  model  we  assume  is 

(9.2.1)  xu  =  fi  +  cct  +  Sij 

where  for  every  i  and  j,  £0-  is  normally  distributed  with  expectation  0  and 
variance  a2.  This  means  that  the  measured  xu  for  any  individual  item  is  made 
up  of  three  parts  which  are  added  together,  an  over-all  average  value  denoted 
by  //,  an  effect  due  to  the  particular  treatment  undergone  by  the  ith  sample, 
denoted  by  at  (and  supposed  to  be  the  same  for  all  members  of  this  sample), 
and  a  random  or  error  term  etj  due  to  many  unspecified  causes.  These  Etj  are 
supposed  to  be  uncorrelated  with  each  other. 

Adding  the  xu  for  all  items  in  all  k  samples,  we  get 

(9.2.2)  E*«-tfM  +  E.tfi*  +  Efy 

where  N  =  Z  Nt.  We  can  suppose  that  ju  and  the  cct  are  so  adjusted  that 
Z  Nt  cti  =  0.  If  this  does  not  happen  to  be  the  case  at  first,  and  if  Z  Nt  at  =  h, 
we  simply  have  to  subtract  h/(k  7Vf)  from  each  cct  and  add  h/N  to  jx.  Then  if  x  is 
the  over-all  mean  of  the  xtj  ( =  N~ 1  J]y  xtJ)9  we  see  that  x  is  an  unbiased  estimator 
of  pL.  The  total  sum  of  squares  for  all  the  xu  may  be  defined  by 

(9.2.3)  S,=£(x,.,.-x)2 

=  I  *./  -  G 
where  G  =  Nx2  =  (J^  jc0)2/7V.   Now 

where  3c,-  is  the  mean  of  X  for  the  ith  sample,  and  therefore 

(xu  -  x)2  =  (xl7  -  xt)2  +  (x,  -  5c)2  +  2(xt  -  x)(xtj  -  3c,.). 
The  first  expression  for  St  above  can  then  be  written 

(9.2.4)  St  =  X  (Xij  -  xd2  +  Z  Nfa  -  x)2 


+  2 


Z  [(x«-  -  x)  Z  fej  -  *i)l 


The  last  term  in  this  equation  vanishes  since  Zy  (xij  —  3c^)  =  0.  Also 
Zj  (^iv  —  5cf)2  is  the  sum  of  squares  for  the  xu  belonging  to  the  ith  sample, 
which  we  may  denote  by  Si9  so  that  the  first  term  on  the  right-hand  side  of 
Eq.  (4)  is  Zi  $i-  This  is  generally  called  the  sum  of  squares  within  samples, 
denoted  by  Sw.    The  remaining  term  depends  on  the  means  of  the  various 
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samples  and  their  relation  to  the  over-all  mean.  It  is  called  the  sum  of  squares 
between  samples,  denoted  by  Sb.  Then 

(9.2.5)  St  =  Sw  +  Sb 
where 

(9.2.6)  Sw  =  Z(xij-xi)2 

=  E%2-2>ix,2 


- 1  *.,2-S  (?*")' 


and 

(9.2.7) 


'.7 

'     at, 

s» 

i 

-x)2 

i 

-Nx1 

=  7,(?Xi 

>)2 

1    Nt-G 

It  should  be  noted  that  the  splitting  up  of  the  total  sum  of  squares  into  the 
two  parts  Sw  and  Sb  is  a  matter  of  algebra  and  does  not  depend  on  any  assump- 
tions about  the  normality  of  the  distribution  or  the  constancy  of  variance 
between  samples.  However,  on  the  null  hypothesis  that  all  the  xtj  come  from 
a  single  normal  population  with  variance  a2  (this  is .  equivalent  to  assuming 
that  all  the  af  in  Eq.  (1)  are  separately  zero),  it  follows  that  SJa2  is  a  #2-variate 
with  N  —  1  degrees  of  freedom.  Similarly,  within  the  ith  sample,  SJcr2  is  a 
X2-variate  with  Nt  —  1  d.f.,  so  that,  by  the  addition  theorem  for  independent 
%2-variates,  SJa2  is  distributed  like  y2  with  £,  (Nt  —  1)  =  N  —  k  d.f.  Theorem 
4.3  then  tells  us  that  SJa2  is  a  #2-variate  with  N-l-(N-k)  =  k-l  d.f., 
and  is  independent  of  Sw. 

Since  the  expectation  of  a  #2-variate  is  equal  to  the  number  of  degrees  of 
freedom,  it  follows  that 


(9.2.8) 
so  that 
(9.2.9) 


(E(SJa2) 

=  N- 

-1 

U(SJc2) 

=  N- 

-k 

\E(Sblo2) 

=  k- 

-1 

E[S,I(N  - 

D]   = 

=  <T2 

£[SJ(N  - 

-fe)]  = 

=  <T2 

.E[S»/(fc  - 

D]    ■ 

=  (T2 
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Furthermore,  the  ratio  of  the  two  independent  unbiased  estimators  of  o"2, 
namely  Sb/(k  -  1)  and  SJ(N  -  k)  has  the  F  distribution  with  k  -  1  and  N  —  k 
degrees  of  freedom.  These  estimators  are  usually  called  "mean  squares."  The 
result  is  set  out  in  an  Analysis  of  Variance  table,  such  as  Table  9.2. 

Table  9.2 


Variation 

Sum  of  Squares 
(S.S.) 

Degrees  of  Freedom 
(D.F.) 

Mean  Square 
(M.S.) 

Between  samples 
Within  samples 

Sb 

St 

k  -  1 

N-k 

SbKk  -  l) 

Sw/(N  -  k) 

Total 

N-  1 

StliN  -  1) 

If  the  null  hypothesis  is  not  true,  the  a£  will  not  all  be  zero,  and  the  mean 
square  between  samples  will  tend  to  be  greater  than  the  mean  square  within 
samples.*  The  F-test  will  therefore  be  a  one-tailed  test,  and  the  probabilities 
given  in  the  table  (Appendix  B.5)  are  correct  as  stated  there. 

As  an  illustration  we  may  consider  the  data  of  Example  1,  §  9.1,  in  which 
k  =  5  and  all  the  Nt  are  4  (except  N3,  which  is  3).  We  find  ^tj  *»•/  =  46>  842>  *  50> 
£.  .  Xij  =  29,780,  whence  G  =  46,  676,  232,  and  St  =  165,  918.  Also  the  five 
values  of  (EyXy)2  are  (6700)2,  (6185)2,  (4585)2,  (6150)2  and  (6160)2,  whence 
Sw  =  106,661.  The  value  of  Sb9  by  difference,  is  59,257.  The  results  of  the 
analysis,  set  out  in  the  form  of  Table  9.2,  are  as  follows: 

Table  9.3 


Variance 

S.S. 

D.F. 

M.S. 

Between  samples 
Within  samples 

59,257 
106,661 

4 
14 

18 

14,814 
7,619 

Total 

165,918 

9,218 

The  vaiue  of  Fis  (14814)/(7619)  =  1.94,  with  4  and  14  d.f.  Since  the  5%  point 
is  3.1 1  and  the  1  %  point  5.03,  it  is  clear  that  the  observed  value  is  not  significant. 
We  can  therefore,  as  far  as  this  test  is  concerned,  accept  the  null  hypothesis  that 
the  strength  of  the  concrete  was  not  affected  by  the  different  amounts  of  coal  in 
the  sand  used  in  making  it. 

When  the  k  samples  are  all  of  the  same  size,  say  r,  N  =  rk  and  £  af  =  0. 
Eq.  (7)  becomes 


(9.2.10) 


;?(?»»)- 


9.3  Two- Way  Classification  (Complete  Blocks)  In  a  somewhat  more  com- 
plicated experimental  design,  the  attempt  is  made  to  estimate  two  effects  simul- 
taneously.   In  the  example  above  of  the  concrete  cylinders,  we  might  have 
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allowed  the  concrete  to  set  for  different  periods  of  time  before  testing  its  strength, 
In  a  field  experiment  on  the  yield  of  a  certain  crop,  under  different  treatments, 
we  might  wish  to  estimate  the  effect  of  different  locations  of  the  experimental 
plots.  (There  might  for  instance  be  appreciable  effects  due  to  differences  of  soil 
moisture,  drainage,  slope,  shade,  etc.,  as  well  as  natural  differences  of  soil 
fertility).  The  experimental  procedure  is  then  to  set  out  the  plots  in  distinct 
blocks,  the  same  number  of  plots  in  each  block,  arranged  so  that  as  far  as 
possible  the  plots  in  any  one  block  are  relatively  similar.  The  experimental 
treatments  are  applied  randomly  to  the  plots  within  each  block.    Figure  44 


Block  1 


T4 

T5 

*i 

Tz 

T2 

*i 

T5 

T3 

T2 

T4 

r3 

*i 

T2 

T, 

T4 

t2 

*i 

T* 

T4 

T, 

T4 

Tx 

T5 

T2 

TS 

T4 

T5 

Tz 

*1 

T2 

Block  2 


Block  3 


Fig.  44    Complete  randomized  blocks 

suggests  a  possible  arrangement  in  which  five  treatments  are  used  in  each  block, 
and  each  treatment  is  replicated  on  two  plots.  This  is  an  illustration  of  a 
"complete  block  design."  The  purpose  of  randomization  is  to  reduce  as  far  as 
possible  any  systematic  effects  of  the  uncontrolled  factors  in  the  experiment, 
and  to  give  increased  justification  for  applying  statistical  theory. 

In  general  we  will  suppose  that  we  have  a  treatments  and  b  blocks,  and  that 
each  treatment  in  each  block  is  replicated  r  times.  The  total  number  of  indi- 
vidual items  (plots)  in  the  experiment  is  N  =  abr.  A  variate  X  is  measured  on 
each  item,  and  we  will  denote  by  xijk  the  value  of  X  for  the  ith  treatment  in  the 
jth  block,  on  the  kth  replicate.  We  suppose  that  the  ith  treatment  has  an  effect 
on  X  measured  by  aif  and  that  the  blocks  also  have  their  effect,  the  yth  block 
contributing  fij  to  X.  There  may  also  be  a  differential  effect  of  treatments  in 
different  blocks,  known  as  "interaction."  (The  /th  treatment  may  not  contribute 
the  same  amount  to  X  in  each  block.)  Assuming  that  these  various  effects  can 
be  added  together,  we  have  as  our  mathematical  model: 


(9.3.1) 


'O'fc 


p.  +  oct  +  Pj  +  ytJ  +  eijk 


where  \i  is  the  over-all  average  effect,  ytj  is  the  interaction  (the  extra  contribution 
of  the  ith  treatment  in  the  /h  block  over  and  above  the  general  effect  of  this 
treatment),  and  eiJk  is  the  random  effect  shown  by  the  kth  replicate  in  the  jth 
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block  under  the  ith  treatment.  This  random  part  of  xijk  is  due  to  all  the  miscel- 
laneous causes  which  may  produce  an  effect  but  which  are  not  specifically 
allowed  for  in  the  design  of  the  experiment.  It  is  usually  considered  as  experi- 
mental error. 

We  can  always  adjust  the  origins  from  which  <xh  /?,•  and  ytj  are  measured, 
so  as  to  satisfy  the  conditions : 

(9.3.2)  2>  =  °>        1/0  =  0,        Eyo=0,       j  =  l,2...b 

i  J  i 

Y,7,j=0,        I -1,2... a 

J 

We  may  suppose  also  that  E(eijk)  =  0  and  V(eijk)  =  a2. 

The  total  sum  of  squares  may  be  split  into  four  constituent  parts,  namely, 


(9.3.3) 

s,  =  s„  +  sb  +  sab  +  sr 

where 

(9.3.4) 

tjk 

(9.3.5) 

o  =  Nx2  =  N-ig:kxIJky 

(9.3.6) 

$a=(brr1l(l:xijky-G 

(9.3.7) 

S,  =  (ar)-1i;(i:^)2-G 

(9.3.8) 

Sab  ~  r         JL  (Zj  Xijk\     ~  Sa  ~  * 

(9.3.9) 

ijk                                 ij    \  k            J 

Sh-G 


The  four  terms  on  the  right-hand  side  of  Eq.  (3)  are,  respectively,  the  sum  of 
squares  (S.S.)  between  treatments,  the  S.S.  between  blocks,  the  S.S.  due  to 
interaction  and  the  S.S.  between  replicates.  The  degrees  of  freedom  are  a  —  1, 
b  -  1,  (a  -  \)(b  -  1)  and  ab(r  -  1). 

If  3c  f . .  is  the  mean  of  xijk  taken  over  all  blocks  and  replicates  for  the  ith 
treatment,  and  if  x  is  the  mean  of  all  the  observed  xijk,  then 

(9.3.10)  Sa=Z(x£-..-3c)2 

ijk 

and  this  is  algebraically  equivalent  to  Eq.  (6).  Similarly, 

(9.3.11)  S,  =£(*.;. -3c)2 

ijk 

where  3c  mJ .  is  the  mean  for  the/h  block,  over  all  treatments  and  replicates ; 

(9.3.12)  Sr=X(%*-%)2 

ijk 
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where  xtj .  is  the  mean  over  the  k  replicates  for  the  rth  treatment  in  the/h  block; 
and 


(9.3.13) 


$ab  —  Zj  (Xij 
ijk 


+  5c) 


The  mean  squares  may  be  calculated  as  before.  On  the  null  hypothesis  that  all 
the  af,  pj  and  ytj  are  zero,  and  on  the  assumption  that  the  sijk  are  normally 
distributed,  these  mean  squares  are  all  unbiased  estimators  of  a2.  Moreover, 
the  mean  squares  for  treatments,  blocks  and  interactions  are  independent  of 
the  mean  square  for  replicates,  so  that  the  ratios 

Sa  ab{r  -  1)      Sb  ab(r  -  1)  Sab      ab(r  -  1) 

and    — - 


1 


Sr 


1 


Sr  (a  -  l)(b  -  1) 


all  have  the  F  distribution  with  the  appropriate  degrees  of  freedom.  These 
ratios  can  therefore  be  used  to  test  whether  there  are  significant  treatment 
effects,  block  effects,  or  interaction  effects. 

Unless  r  is  greater  than  1  there  is  no  possibility,  with  this  design,  of  testing 
for  interaction  by  the  ordinary  F-test.  If  r  =  1,  the  interaction  effect  is  generally 
ignored  or  treated  as  part  of  the  error.  If  it  is  assumed,  however,  that  rtj  is  of 
the  form  C  at  pj9  where  C  is  constant,  an  F-test  of  the  hypothesis  rtj  =  0  is 
possible.  See  [7],  p.  1 30. 

Example  2  Tests  were  carried  out  on  sheets  of  building  material  for  per- 
meability [4].  Specimens  were  selected  from  the  output  of  each  of  three  machines 
on  each  of  nine  days,  and  for  each  machine  on  each  day  three  sheets  were 
examined.  The  raw  materials  all  came  from  a  common  store,  but  it  was  thought 
that  the  machines  might  vary  in  their  quality  of  output  and  might  also  vary 
from  day  to  day.  The  machines  may  be  regarded  as  "treatments"  and  the  days 
as  "blocks,"  and  there  were  three  replicates.  The  randomizing  within  blocks 
was  done  by  varying  the  order  of  sampling  from  the  machines  on  the  different 
days. 

Table  9.4. 


Variation 

S.S. 

D.F. 

M.S. 

Between  machines 
Between  days 
Interaction 
Between  replicates 

0.9168 
0.5534 
0.8657 
2.0150 

2 
8 

16 
54 

80 

0.4584 
0.0692 
0.0541 
0.0373 

Total 

4.3509 

In  this  experiment  the  measured  variate  was  the  permeability  (an  average 
of  eight  measurements  on  each  sheet).  Since  it  appeared  that  the  logarithm 
of  the  variate  was  more  nearly  normally  distributed  than  the  variate  itself, 
the  values  in  the  above  table  all  relate  to  the  common  log  of  the  permeability. 
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This  is  denoted  by  xijk.   From  the  experimental  data,  we  find 

N  =  3-9-3  =  81 

£  xiJk  =  127.093,        G  =  199.4152 


Uk 

£  *./  =  203.7661,        5,  =  4.3509 

2 

=  5408.9627,        S„  =  0.9168 


1799.7177,        Sft  =  0.5534 
605.2533,  5r  =  2.0150 


and  therefore  Sab  =  0.8657.  The  analysis  of  variance  is  given  in  Table  9.4. 
The  F-ratio  for  interaction  is  1 .45,  with  16  and  54  d.f.  Since  the  5  %  point  is  1 .83, 
the  hypothesis  of  zero  interaction  is  not  rejected. 

The  F-ratio  for  days  is  1.85,  with  8  and  54  d.f.  This  is  also  non-significant 
at  the  5%  level. 

The  F-ratio  for  machines  is  12.3,  with  2  and  54  d.f.  The  1  %  point  is  5.02, 
so  that  there  is  a  highly  significant  effect  of  the  differences  between  machines. 

If  there  are  no  interactions,  the  conclusions  about  the  main  effects  are  much 
simplified.  If  there  seems  to  be  a  real  difference  between  two  treatments,  for 
example,  we  can  conclude  that  this  difference  persists  in  all  blocks.  But  if  there 
is  appreciable  interaction,  a  significant  difference  between  the  two  treatments 
merely  means  that,  on  the  average  over  all  blocks,  there  is  a  difference.  In  some 
particular  block  this  difference  might  not  exist  or  might  even  be  reversed  in  sign. 

It  may  happen  that  the  hypothesis  of  no  interactions  will  be  rejected  by  the 
ordinary  statistical  test  while  at  the  same  time  the  hypothesis  of  zero  main 
effects  will  be  accepted.  This  means  that  there  certainly  are  non-zero  differences 
between  blocks  or  treatments,  but  that  when  the  block  differences  are  averaged 
over  the  treatments,  or  the  treatment  differences  over  the  blocks,  the  averages 
are  not  significantly  different  from  zero. 

9.4  Estimation  of  Fixed  Treatment  Effects  (Model  I)  There  are  two  ways 
of  looking  at  the  treatment  effects.  They  may  be  regarded  as  fixed  effects  or  as 
random  variables,  and  in  different  situations  either  the  one  way  or  the  other 
may  be  more  appropriate.  In  Example  1  of  §  9.1,  the  "treatments"  were  fixed 
percentages  of  coal  in  the  sand  used  for  making  concrete,  and  any  conclusions 
drawn  from  the  experiment  would  presumably  refer  to  these  percentages  and 
these  only.  However,  it  is  conceivable  that  the  specimens  of  sand  used  might 
have  contained  variable  amounts  of  coal,  drawn  at  random  from  some  parent 
distribution,  and  in  this  case  we  could  estimate  the  variance  of  the  effect  of 
added  coal  and  apply  the  results  of  the  experiment  to  percentages  of  coal  outside 
the  values  actually  observed. 
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In  Example  2,  above,  the  machine  effects  should  probably  be  regarded  as 
fixed,  and  the  results  applied  to  the  particular  machines  used.  The  days,  how- 
ever, might  well  be  considered  a  random  sample  of  days,  unless  there  was  a 
special  reason  for  selecting  particular  days  for  the  experiment. 

The  mathematical  model  with  fixed  effects,  which  is  the  one  we  have  been 
using,  is  sometimes  called  Model  I.  For  the  one-way  classification  of  §9.1, 
we  have 

(9.4.1)  *y=A*  +  a*+-ey.     i  =  1,  2  .  .  .  fc,    j  =  1,  2  .  .  .  Nt 
and  the  mean  of  the  />th  sample  is 

(9.4.2)  xt .  =  \i  +  a,-  +  et . 

The  over-all  weighted  mean  of  the  *;.,  with  weights  Ni9  is 

(9.4.3)  5c  =  n  +  e 

since  £  Npi  =  0. 

The  expectation  of  xt.  is,  therefore,  [i  +  <xit  and  its  variance  is  c2/^.  The 
sums  of  squares  between  samples  is  given  by 

(9.4.4)  Sfc  =  £^(xf.-5c)2 

i 
i 

=  X  NIX, .  -  n  -  «,  -  (x  -  m)]2  +  I  JV.-a,2 

i  i 

+  2  £  iV.a^Xf .  -  \i  -  cct  -  (x  -  //)] 

i 

Now  x(.  —  \x  —  af  is  normal  with  expectation  0  and  variance  02/Nh  and  its 
weighted  mean  is  x  —  ft.  It  therefore  follows  that  £,(A^f/o'2)[3ci .  —  ju  —  af  —  (3c  —  /i)]2 
has  the  #2  distribution  with  k  —  1  d.f.,  and  hence 

(9.4.5)  E  £  Nl* .  -  ji  -  «,  -  (5c  -  /i)]2  =  "(*  -  1)<72 
Also  £[*,- .  —  /i  —  a£  -  (x  —  /*)]  =  0,  so  that 

(9.4.6)  E(Sb)=(k-l)a2+YJNi«i2 

i 

This  shows  that  if  the  o^  are  not  all  zero,  the  expectation  of  SJ(k  —  1)  is 
greater  than  <r2,  which  justifies  the  use  of  the  one-tailed  F-test  for  treatment 
effects  between  samples.  In  the  same  way, 

(9.4.7)  E(St)  =  (N-  1)<72  +  X  Np? 

i 

and  therefore, 

(9.4.8)  E(Sw)=(N-k)a2 
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The  fixed  treatment  effects  do  not  enter  into  the  mean  square  within  samples. 
From  Eq.  (3), 

(9.4.9)  E(x)  =  n 

so  that  3c  is  an  unbiased  estimator  of  \x.  Also,  from  Eq.  (1), 


(9.4.10) 


(?*»H 


\x  +  Nidi 


so  that  af  =  E  (xt.)  —  fi.  The  quantity  x{.  —  x  is  therefore  an  unbiased 
estimator  of  cct.  In  Example  1,  the  estimates  of  cct  are  $LX  =  108,  6t2  = 
-21,  6t3  =  -39,  a4  =  -30,  a5  =  -27. 

For  the  two-way  classification  of  §  9.3,  with  fixed  effects,  a  similar  argument 
leads  to  the  following  results : 

(E(Sa)   =(a-])<r2  +  brljai2 

i 

E(S„)  =  (6-l)<72+ar£/?/ 

J 

jE(Sab)=(a-\)(b-l)a2  +  r^yij2 
E(Sr)    =  ab(r-\)a2 
E(St)    =  (abr  -  1)<72  +  br  £  a,-2  +  ar  £  /?/  +  r  J]  yf/ 


The  estimators  of  af,  j5;,  y0-  and  ^  are 


'  6tt  =  xt 


(9.4.12) 


Pj 


x, 


-x, 


i  =  1,  2  .  .  .  a 
7  =  1,2...  6 


fry  =  *y -**••-* ■;•  +x 

pi     =  X 


9.5  Estimation  of  Variable  Treatment  Effects  (Components  of  Variance) 
Model  II  In  Model  II,  the  effects  (even  including  the  interaction)  are  treated 
as  random  variates  which  are  normally  and  independently  distributed.  Thus  in 
Eq.  (9.2.1),  af  is  regarded  as  a  value  of  a  random  variate  which  has  expectation 
zero  and  variance  a2.  We  must  suppose  that  the  k  samples  actually  examined 
are  a  random  selection  from  a  large  population  of  possible  samples.  The 
members  of  this  population  may  be  denoted  by  the  subscript  w,  and  for  each 
there  is  a  "true"  or  expected  value  of  X  which  we  may  denote  by  mu.  This 
quantity  mu  is  a  random  variable  with  a  certain  probability  distribution  over 
the  population,  and  its  expected  value  is  ft.  The  difference  between  mu  and  /i 
is  the  true  effect  of  sample  w,  which  we  have  denoted  by  aM.  The  variance  of  <xu 
is  the  quantity  a2.  We  have,  therefore,  for  an  actually  selected  sample  i, 


(9.5.1) 


™f  +  e«v  =  V  +  at  +  eu 
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where  e^  is  the  error  term,  namely,  the  difference  between  the  true  value  mt  and 
the  measured  value  on  the/h  replicate.  It  is  assumed  that  the  set  of  af  and  the 
set  of  e^  are  completely  independent  and  that  the  e,-7-  have  the  same  variance  a2 
for  all  i,  and  therefore, 

(9.5.2)  V(Xij)=(7a2^c72 

The  quantities  o2  and  o2  are  the  components  of  variance. 

It  should  be  noted  that  two  observations  in  the  same  sample  will  be  cor- 
related.  Thus  the  co variance  of  xtj  and  xiy  is  given  by 

C(xij}  xir)  =  E[(Xij  -  n)(xir  -  iij\ 

=  E[(af  +  ByX«l  +  £U')] 

=  E(a2)=aa2 

since  all  the  other  terms  have  zero  expectations  by  hypothesis.  The  quantity 
°2l(.a2  +  °2)  which  is  the  correlation  coefficient  between  two  observations  in 
the  same  sample,  is  called  the  intra-class  correlation  coefficient. 

The  usual  null  hypothesis  to  be  tested  is  that  o2  =  0,  which  implies  that 
mu  =  ft  for  all  values  of  u.  If  we  suppose  that  all  the  samples  are  the  same 
size  (r),  the  sums  of  squares  between  samples  and  within  samples  are,  as  before, 

(9.5.3)  S6  =  rX(*,..-x)2 

i 

=  r  X  Kfji  +  a,  4-8,-.)  -  0*  +  a  +  e)]2 

i 

and 

(9.5.4)  S„=£(x„ -*,.)* 

=  Ife;-U2 

U 

If  the  8^  are  normal  with  expectation  0  and  variance  <r2,  SJa2  has  the  x2 
distribution  with  k(r  —  1)  d.f.,  so  that 

(9.5.5)  E-^-=a2 

k(r  -  1) 

If  also  the  ctt  are  normal  with  variance  o2,  and  if  we  write  nt  for  the  variable 
<xt  +  ef.,  then 

(9.5.6)  S^r^im-ri)2 

i 

and  the  7/f  are  independently  normal  with  expectation  0  and  variance  a2  +  <r2/r. 
It  follows  that  Sb/[(<ra2  +  G2/r)r]  is  a  chi-square  variate  with  k  —  1  d.f.,  so  that 

(9.5.7)  E-^-  =  ro2+a2 

k  —  1 

The  sums  of  squares  Sb  and  Sw  are  statistically  independent.  The  null  hypothesis 
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is  therefore  tested  by  the  ordinary  F-test  of  the  ratio  of  mean  squares  between 
and  within  samples. 

The  power  of  this  test  is  a  function  of  g2\g2.  If  F  has  the  ordinary  F-dis- 
tribution  with  k  —  1  and  k(r  —  1)  d.f.,  and  if  Fa  is  the  value  of  F  exceeded  with 
probability  a,  then  the  power  is  given  by 

(9.5.8)  P=Pr(fsFo_^_) 

For  the  two-way  layout  of  §  9.3,  a  similar  set  of  assumptions  leads  to  the 
model 

(9.5.9)  xijk  =  \x  +  oti  +  fij  +  jij  +  sijk 

where  the  ah  the  pj9  the  ytj  and  the  sijk  are  independently  and  normally  dis- 
tributed with  zero  expectations  and  variances  G2,Gfi2,  g2  and  a2,  respectively. 
The  variance  of  xijk  is  then  given  by 

(9.5.10)  V(xijk)  =  <j2+o2+a2  +  g2 
The  sum  of  squares  for  ^4-effects  is 

(9.5.11)  Sa  =  fcr£(x,..-3c)2 

i 

'  =^I(«i-a  +  yi.-y+£i--e)2 

i 

If  rjt  =  af  +  yt.  +  £;..,  then  rjt  is  normal  with  expectation  0  and  variance 
g2  +  Gy2/b  +  G2/(br).  Also  its  mean  over  the  a  values  of  i  is  fj  =  a  +  %  +  fi. 

Therefore,     2  2*  ~  ^  2/„  ; is  X2  with  «  ~  1  d-f-»  and 

g2  +  <7y2/6  +  G2/(br) 

(9.5.12)  £[5a/(a  -  1)]  =  br(p2  +  <7y2/6  +  G2/(br)) 

=  brG2  +  rffy2  +  g2. 

A  similar  argument  leads  to  the  results : 

(9.5.13)  E  -^-  =  arG2  +  ra2  +  g2 

b  —  1 

(9514>  E(,-w-irrff'+g' 

(9-5-15)  E^hrr°2 

The  four  sums  of  squares  are  statistically  independent,  and  therefore  ordinary 
F-tests  of  the  hypotheses  g2  =  0,  gp2  =  0,  g2  —  0  may  be  carried  out. 

It  should  be  noted  that,  contrary  to  the  conclusions  from  the  fixed-effects 
model,  the  interaction  component  of  variance  appears  in  the  mean  squares  for 
,4-effects  and  5-effects.  The  F-test  for  the  null  hypothesis  g2  —  0  must  therefore 


228  INTRODUCTION  TO  STATISTICAL  INFERENCE  9.6 

be  carried  out  by  comparing  the  mean  squares  for  ^-effects  and  interaction,  and 
similarly  for  op2  =  0.  If  r  >  1,  we  can  test  for  interaction  by  an  F-test  of 
Eq.  (14)  against  Eq.  (15). 

Similar  but  more  complicated  models  can  be  used  when  there  are  three  main 
effects,  with  three  second-order  interactions  and  one  third-order  interaction. 
With  Model  II  a  complication  arises  when  the  interactions  are  not  negligible, 
because  it  turns  out  to  be  impossible  to  apply  the  F-test  directly  in  order  to  test 
for  the  main  effects.  An  approximate  method  suggested  by  Satterthwaite  may 
be  tried  in  such  cases — [5],  [6]. 

*  9.6  Mixed  Models  (Model  III)  A  layout  in  which  it  seems  reasonable  to 
regard  one  effect  as  fixed  and  another  as  random  is  said  to  be  mixed.  In  a 
problem  concerned  with  the  daily  output  of  workers  in  a  factory  using  certain 
machines  [7],  we  might  be  inclined  to  regard  the  workers  as  a  random  sample 
from  a  large  population,  but  we  might  be  interested  in  the  performance  of 
individual  machines,  perhaps  of  different  makes. 

Let  xijk  be  the  output,  say,  for  the/h  worker  on  the  kth  day  that  he  is  assigned 
to  the  itb  machine,  (i  =  1,  2  . . .  a,j  =  1,  2  . .  .  b,  k  =  1,  2  ...  r).  The  days  will 
be  regarded  merely  as  replicates,  the  effects  in  which  we  are  interested  being 
the  fixed  effects  a£  of  machines  and  the  random  effects  pj  of  workers  as  well 
as  their  possible  interactions.  We  assume  that 

(9.6.1)  xiJk  *»  miS  +siJk 

where  m^  is  the  "true"  mean  output  of  the/h  worker  on  the  ith  machine,  and 
the  sijk  are  independent  normal  variates  with  mean  zero  and  variance  a2.  Since 
the  yth  worker  is  regarded  as  a  random  selection  from  a  large  population  of 
workers,  we  can  think  of  mtj  as  a  particular  value  of  a  random  variable  Mt 
which  represents  the  mean  output  of  a  worker  selected  at  random  on  the 
machine  numbered  i. 

Let  the  expected  value  of  Mt  over  the  population  of  workers  be  denoted 
by  jUj  and  let  the  arithmetic  mean  of  the  fit  over  the  i  machines  be  denoted  by  ft. 
Then 

(9.6.2)  m=E(M^         fi=fl.=Y,- 

a 

The  main  effect  of  the  ith  machine  is 

(9.6.3)  «/ =/*/-/*,        Z«,=0 

Suppose  miw  is  the  value  of  Mt  for  any  worker  labelled  w  in  the  whole 
population  of  workers.  The  true  mean  for  this  worker  is  the  average  of  miw 
over  the  i  machines  and  the  main  effect  pw  of  worker  w  is  the  excess  of  this 
over  the  general  mean. 

(9.6.4)  pw  =  m.w  -  E(M.) 
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This  is  a  random  variable  with  expectation  0,  and  variance,  say,  op2. 

The  main  effect  of  worker  w,  specific  to  the  itb  machine,  may  be  denned  as 
miw  —  E(Mt)  =  miw  —  \iv  and  the  excess  of  this  above  its  average  over  the 
machines  is  called  the  interaction  of  the  ith  machine  and  the  particular  worker  h>, 
namely, 

(9.6.5)  yiw  =  miw  -  juf  -  m  w  +  \i 

For  each  i,  this  is  a  random  variable  with  expectation  zero  and  variance  oyi2. 
Also,  £i  y,w  =  0  for  each  value  of  w. 
From  Eqs.  (3),  (4)  and  (5)  we  obtain 

(9.6.6)  miw  =  \i  +  a,  +-fiw  +  yiw 
and,  for  any  worker  w  and  any  day  d, 

(9.6.7)  xiwd  =n  +  cti+pw+  yiw  +  siwd 

where  the  siwd  are  independent  of  each  other  and  of  pw  and  yiw,  and  have  a 
common  variance  a2.  The  ylw,  for  different  values  of  /,  are  not  necessarily 
independent  of  each  other  or  of  ^w,  but  have  covariances  depending  on  those 
of  the  random  variables  Mt. 

The  b  workers  actually  used  in  the  experiment  may  be  regarded  as  a  random 
sample  from  the  whole  population  of  workers,  and  the  r  days  similarly  form  a 
sample  of  all  possible  days.  The  xijk  of  Eq.  (1)  has  therefore  the  same  form  as 
the  xiwd  of  Eq.  (7),  but  j  takes  only  the  values  I, '2  ...  b,  and  k  takes  only  the 
values  1,  2  ...  r.  That  is, 

(9.6.8)  xijk  =  n  +  af-  +  pj  +  ytj  +  sijk 

The  Pj  for  different  values  of  j  may  be  looked  on  as  independent  variates  all 
having  the  same  distribution  as  /?w,  and  the  ytj  are  similarly  independent  with 
the  same  distribution  as  yiw,  for  any  /.  The  sijk  can  be  regarded  as  independent 
of  each  other  and  of  the  Pj  and  the  y0-. 

For  convenience  of  notation  we  may  define  g2  by  the  relation 

(9.6.9)  (a-lK2=I>,2 

i 

but  it  must  be  remembered  that  we  are  treating  the  o^  as  fixed  effects,  so  that 
o2  is  not  the  variance  of  a  random  variable.   Also  we  will  define  a2  by 

(9.6.10)  (a-l)<7/=5>/ 

i 

The  division  of  the  total  sum  of  squares  into  four  parts  may  be  carried  out 
just  as  in  Model  I  or  Model  II.  We  get 

(9.6.11)  St  =  Sa  +  Sb  +  Sab  +  Sr 


(9.6.12) 
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where 

(Sa  =brZ(xi..-x)2 

i 

Sb  =arYJ(x.J.-x)2 
j 

Sab  =  rYJ(xij.  -xt..-x.j.  +  x)2 

ij 

$r     =Zj  (xijk  ~  *ij  ■) 
V  ijk 

It  is  straightforward  to  show,  as  in  §  9.4,  that  if  the  sijki  fij  and  ytj  are 
normally  distributed 

(9.6.13)  E  -^-  =  bra2  +  ray2  +  o1 

Also,  by  using  Eq.  (8)  in  the  second  equation  of  (12)  and  noting  that  y.j  =  0 
for  ally,  since  Jjij  =  0,  we  find 

i 

(9.6. 14)  S^arZ  [(/?;  -  j3 .)  +  (e. ,. .  -  s)]2 

J 

=  <"-I(/J,-jB.)2  +  arX(s-.,..-a)2 

7*  J 

+  2ar£  (ft -£)(«.,..-<>) 

The  expectation  of  £/(/?,•  ~~  /02  *s  (^  ~~  1)  °>2-  Since  e.7-.  is  normal  with 
variance  a2\ari  the  expectation  of  £,.  (e.,.  -  e)2  is  similarly  (6  —  X)a2\ar.  The 
expectation  of  the  product  term  vanishes  because  of  the  independence  of  fij 
and  sijk.  Therefore, 


(9.6.15) 
so  that 
(9.6.16) 


E(Sb)  =  ar(b  -  l)a2  +  (b  -  l)a: 
_    Sh 


b-\ 


aro(?  +  a2 


In  the  same  way  we  can  write 

(9.6.17)  Sab  =  r  £  (yy  -  y, .  +  8y .  -  e, . .  -  e.  ,• .  +  e)2 

Now  EYjj  (Jij  —  li)2  =  (b  —  1)  cyi2  since  the  ytj  for  given  i  are  independent 
variates  with  variance  Gyi2.  It  follows  that  E  £l7  (7ij  —  7t  )2  =  («  —  1)(£  —  1)  cy2- 
Let  us  define  a  variate  77 17  by 

Then  77  0-  -rj-j=  £,•; .-£,-..-£.;.+£. 
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Since  eu .  has  variance  <r2/r,  the  variance  of  rju  is  (b  —  \)a2/br  (see  §  2.14).  The 
expectation  of  J]jOfy  ~  *7j)2  *s  therefore  (a  -  l)(b  —  \)G2/br,  and  that  of 
Zo"  fay  —  ?*j)2  *s  ^  times  as  great.   From  Eq.  (17), 

Saft  =  r  X  [(yy  -  y,  .)2  +  toy  -  ?.  ;)2  +  2(yy  -  y*  )0/y  -  ?.  ,)] 
y 

The  expectation  of  the  product  term  is  zero,  because  of  the  independence 
of  y_y  and  eijk,  and  therefore  we  have 

E(Sab)  =  r(a  -  1X6  -  l}a72  +  (a  -  l)(b  -  1)g2 
Dividing  by  (a  —  l)(b  —  1),  we  obtain  the  expectation  of  the  mean  square, 


(9.6.18) 


=  yg2  +  a2 


(9.6.19) 


(a  -  W>  -  1) 

Finally,  Sr  =  ab  £fc  (sijk  -  £l7.)2  and  E(Sr)  =  ab(r  -  l)o-2,  so  that 

_        S, 


=  a 


ab(r  -  1) 

The  analysis  of  variance  for  the  mixed  Model  III  is  set  out  in  Table  9.5. 

Table  9.5 


Source  of  Variation 

D.F. 

M.S. 

E(M.S.) 

A  -effects  (fixed) 
^-effects  (random) 
A  x  2?-effects 
(interaction) 
Error 

a  -  1 
b-  1 

(a  -  W  -  1) 
ab{r  -  1) 

Sa/(a  -  1) 
Sb/(b  -  1) 

Sab/[(a  -  l)(b  - 
Sr/[ab(r  -  1)] 

-1)] 

braa2  +  ray2  +  a2 
arap2  +  o2 

ray2  +  o2 
a2 

Total 

abr  -  1 

The  four  sums  of  squares  are  pairwise  independent,  except  for  the  pair  Sb 
and  Sab.  We  can  therefore  test  for  interaction  by  comparing  the  mean  squares 
for  interaction  and  error,  test  for  i?-effects  by  comparing  the  mean  squares  for 
^-effects  and  error,  and  test  for  ^-effects  by  comparing  the  mean  squares  for 
,4 -effects  and  interaction. 

Estimates  of  g2,  Gp2,.Gy2  and  g2  can  be  calculated  from  the  last  column  of 
Table  9.5.  The  estimator  of  \i  is  the  over-all  mean  x,  and  estimators  of  af,  Pj 
and  yu  are,  respectively, 

(9.6.20)  6tt  =  5cf..-5c 

9ij=Xij.-Xi..-X.j.  +x 


(9.6.21) 
(9.6.22) 
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*  9.7  Nested  or  (Hierarchal  Models)  One  type  of  incomplete  design  is  that 
in  which  a  factor  B  is  "nested"  within  another  factor  A.  This  means  that  for 
each  level  of  A  there  is  a  set  of  levels  of  B,  and  each  iMevel  occurs  in  just  one 
A-level.  In  the  type  of  design  considered  earlier  in  this  chapter,  each  iMevel 
occurs  in  each  A  -level,  and  A  and  B  are  said  to  be  completely  crossed.  If  this  is 
not  so  but  if  some  IMevel  occurs  in  at  least  two  A  -levels,  the  factors  are  said 
to  be  partly  crossed. 

As  an  example,  we  may  consider  an  experiment  on  the  determination  of 
the  protein  content  of  wheat.  From  many  suitable  laboratories  in  Canada, 
three  were  selected  at  random  and,  in  each  of  these,  five  samples  of  wheat  were 
analyzed  on  each  of  two  days.  All  the  30  wheat  samples  were  parts  of  one 
carefully  selected  master  sample  and  were  thoroughly  mixed  and  randomized. 
There  is  presumably  a  main  effect  between  the  different  laboratories,  and  also 
between  days  in  each  laboratory,  but  day  number  1  in  one  laboratory  has 
nothing  whatever  necessarily  in  common  with  day  number  1  in  another  labora- 
tory. In  fact,  we  might  not  even  use  the  same  number  of  days  in  the  different 
laboratories.  The  day-effect  is  said  to  be  "nested"  within  the  laboratory  effect, 
or,  to  put  it  another  way,  the  day-effect  has  a  lower  rank  in  the  hierarchy  of 
effects  than  the  laboratory  effect.  This  is  why  the  nested  model  is  sometimes 
called  "hierarchal."  There  is  no  need  for  any  interaction  term  in  this  model, 
since  no  iMevel  occurs  with  more  than  one  A-level. 

We  will  suppose  that  the  ^-effect  occurs  at  a  levels,  denoted  by  i  (i  =  1, 
2  ...  a),  and  that,  nested  within  the  ith  level,  there  are  bt  iMevels,  denoted  byy" 
(j  =  1,2...  bt).  In  the  illustration  above,  the  bt  were  all  equal  to  2,  but  this 
is  not  necessary,  We  suppose  also  that  corresponding  to  any  given  i  and  j 
there  are  rtj  replicates  of  the  measurement  of  a  variate  X.  The  rtj  can  be 
different  for  each  pair  of  i  and  j,  but  it  will  be  convenient  to  assume  that  they 
are  all  equal  to  r.  The  measurement  on  the  kth  replicate  will  then  be  denoted 
by  xijk  and  is  given  by 

(9.7.1)  xijk  =  mij  +  eijk 

where  the  eijk  can  be  regarded  as  "errors,"  which  are  independently  and  nor- 
mally distributed  with  expectation  0  and  variance  <r2,  and  are  independent  of 
the  Jtiij.  The  mtJ  are  the  expectations  (or  true  values)  for  the  ith  ,4-level  and 
the  /h  5-level. 

We  may  use  either  Model  I,  II  or  III  for  the  effects.  If  we  suppose,  in  the 
illustration  mentioned  earlier,  that  the  laboratories  are  chosen  at  random  from 
a  large  number,  and  if  the  days  are  also  (as  far  as  possible)  random,  we  should 
use  Model  II,  and  this  model  will  be  assumed  in  the  rest  of  this  section.  There 
is  then  a  large  population  of  ^-levels,  denoted  in  general  by  w,  and  for  the  uth 
level  there  is  a  large  population  of  iMevels  nested  within  it  and  denoted  in 
general  by  v.  The  expectation  of  X,  measured  for  the  vth  sub-level  of  the  uth 
level,  will  be  the  random  variable  muv.  The  ml;-  of  Eq.  (1)  is  the  value  of  muv 
for  u  =  i  and  v  =  j. 
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If  the  conditional  expectation  of  muv  for  fixed  u  is  denoted  by  mu.,  and  the 
expectation  of  wUt  is  denoted  by  m . .,  we  may  write 

(9.7.2)  muv  =  ii  +  aM  +  &,„ 

where 

(fi    =  m. . 
au    =  tnu.  -  m. . 
Puv  =  Muv  ~  ™u- 

As  usual,  fx  is  the  over-all  expectation  of  X.  The  random  variable  ccu  has 
expectation  0  and  variance  a2.  The  random  variable  Puv  has  a  conditional 
expectation  (for  given  u)  zero  and  conditional  variance  apu2.  These  variables 
are  uncorrelated,  as  may  be  shown  with  the  help  of  a  theorem  on  conditional 
expectations  in  Appendix  A.  14.  The  proof  goes  as  follows: 

E(ctuPuv\u)  =  <xuE(Puv\u) 
=  0 

and  therefore,  by  Eq.  (A.  14.4),  E(ccupuv)  =  0,  which  shows  that  the  correlation 
is  zero,  both  variables  having  zero  expectations. 
For  the  variance  of  /?Mt),  by  Eq.  (A.  14.6),  we  have 

V{Puv)  =  V{E(Puv\u)-\+E[V^uv\u)-] 

The  first  term  on  the  right-hand  side  is  zero;  hence, 

(9.7.4)  V(PJ=E(G,H2)  =  cfi2 

The  ^4-levels  actually  selected  in  the  experiment  may  be  denoted  by  ut 
(/  =  1,  2  ...  a),  and  the  5-levels  corresponding  to  u{  by  Vj  (J  =  1,  2 . . .  bt). 
Then 

(9.7.5)  rriij  =  »  +  ctt  +  ptj 

where  cct  is  the  value  of  au  for  u  =  ut  and  ptj  the  value  of  fiuv  for  u  =  ut  and 
v  =  Vj.  These  are  uncorrelated  and  are  both  independent  of  eijk.  We  now 
suppose  that  they  are  also  normally  distributed. 

The  total  sum  of  squares  St  is  divided  into  three  parts : 

(9.7.6)  St  =  Sa  +  Sb  +  Sr 

where 

(Sa  =  rZbi(xi..-x)2 


(9.7.7) 


^      \2 


5ft  =  rZZ(*o-- -*/••) 


$r  —  2,  \Xijk  ~  Xij  -J 

\  ijk 
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It  should  be  noted  that  the  mean  over  i  is  a  weighted  mean  when  the  bt  are  not 
all  equal.  Thus, 


zw 


1*1 


N 


Z  XUk 
Uk 


X  -variate  with 
2-variate  with 


where  N  =  r^^. 

These  three  sums  of  squares  are  statistically  independent.    Details  of  the 
proof  of  this  statement  may  be  found  in  SchefTe's  book  [7]. 

The  quantity  SJa2,  under  the  normality  assumption,  is  a 
(r  —  1)  Z  b(  degrees  of  freedom.  Also  Sbl(o2  +  rop2)  is  a  x 
Zi  (bi  —  1)  d.f.  However,  Sa  is  not  in  general  equal  to  a  constant  times  a  chi- 
square  variate.  If  it  happens  that  o2  =  0  (which  means  that  there  are  no 
^-effects),  it  may  be  shown  that  SJ(a2  +  rap2)  is  distributed  like  x2  with  a  —  1 
d.f.  Also,  if  all  the  bt  are  equal  to  b  (as  in  the  example  of  the  laboratory  analyses 
described  above),  SJ(o2  +  rop2  +  bro2)  is  distributed  like  x2  with  a  -  1  d.f. 
In  the  more  general  case,  the  expectation  of  Sa  is  given  by 

(9.7.8)  E(Sa)  =(a-  1)(<t2  +  ra2)  +  {a  -  \)Ag2 

where 

Y  b-2s 

(9.7.9)  ^    ' 


fr"1*  =  '(?"■-  &) 


which,  when  bt  =  b,  reduces  to  6r(a  —  1). 

The  analysis  of  variance  table  is,  therefore,  as  given  in  Table  9.6. 

Table  9.6 


Variation 

S.S. 

D.F. 

M.S. 

E(M.S.) 

A  -effects 

i?-effects  (nested  in  A) 

Error 

Sa 
Sb 
Sr 

St 

a  -  1 
^i  bt  —  a 
(r  -  1)L*  bi 

Sa/ (a  -  1) 

■fc/CE  &  -  «) 

5r/[(r-l)j:^] 

a2  +  >V  +  ^aa2 

a2  +  >v 

(T2 

Total 

r^ibi-1 

If  bt  =  6  for  all  /,  A  =  br^Yjbi  —  ab. 

The  hypothesis  of  no  i?-efTects  (ap2  =  0)  may  be  tested  by  an  exact  F-test 
of  the  mean  square  for  i?-efTects  divided  by  the  mean  square  for  error.  A  test 
of  the  hypothesis  o2  =  0  may  be  made  by  dividing  the  mean  square  for  ^-effects 
by  the  mean  square  for  ^-effects.  The  power  of  these  tests  can  be  calculated 
as  in  §  8.16,  except  for  the  test  of  a2  =  0  when  the  bi  are  unequal.  In  this 
case  an  approximate  method  must  be  used. 

9.8  Latin  Square  Designs  In  a  complete  experimental  design  there  is  at 
least  one  observation  for  every  possible  combination  of  levels.  The  nested 
design  is  incomplete  because  each  iMevel  appears  in  combination  with  one 
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and  only  one  ^4-level  (that  level  within  which  it  is  nested).  The  Latin  square  is 
another  type  of  incomplete  design  in  which  there  are  three  factors,  all  with  the 
same  number  of  levels  ra,  but  in  which  observations  are  made  on  only  m2 
instead  of  the  possible  m3  combinations.  This  is  obviously  economical  of  effort, 
but  the  design  is  rather  restrictive  (enforcing  the  same  number  of  levels  for  each 
factor)  and  the  usual  analysis  does  not  allow  for  interactions. 

A  Latin  square  is  a  square  array  of  symbols,  m  to  a  row  or  column,  and 
such  that  each  symbol  appears  once  and  only  once  in  each  row  and  each  column. 
Thus  in  the  figure  below  there  are  5  letters  arranged  in  this  way.  In  an  agri- 
cultural experiment  these  might  represent  5  varieties  of  wheat  planted  in  25  plots 
arranged  in  5  rows  and  5  columns  across  a  field.  The  purpose  of  the  arrangement 
is  to  allow  for  a  possible  row-effect  or  column-effect  on  the  yield,  such  as  might 
be  produced  if  there  is  a  definite  fertility  gradient  across  the  field  in  the  direction 
of  the  columns  or  the  rows.  In  an  animal-feeding  experiment  there  might  be  5 
types  of  ration  used  on  25  animals.  If  the  animals  belonged  to  5  different  litters 
(5  to  each  litter)  and  could  be  kept  in  5  types  of  pens,  5  in  each  type,  we  could 
arrange  a  Latin  square  experiment  in  which  each  type  of  pen  contained  just  1 
animal  from  each  litter.  The  5  rations  would  be  allocated  to  the  animals 
according  to  a  Latin  square,  and  we  could  then  eliminate  the  effect  of  litters 
and  of  pens  on  the  gains  in  weight. 

A  Latin  square  remains  a  Latin  square  when  the  rows,  or  the  columns,  are 
permuted  among  themselves.  A  standard  m  by  m  square  is  one  in  which  the 
first  row  and  the  first  column  contain  the  m  symbols  in  their  natural  order,  as 
in  the  right-hand  5  by  5  square  below. 


E 

A 

C 

B 

D 

A 

B 

C 

D 

E 

B 

C 

E 

D 

A 

B 

E 

A 

C 

D 

A 

B 

D 

E 

C 

C 

D 

B 

E 

A 

C 

D 

B 

A 

E 

D 

C 

E 

A 

B 

D 

E 

A 

C 

B 

E 

A 

D 

B 

C 

With  5  symbols  there  are  56  different  standard  squares,  and  from  each  of  these 
we  may  obtain  2880  Latin  squares  by  permuting  rows  and  columns.  The 
number  of  possible  Latin  squares  increases  enormously  as  m  increases.  In  an 
experimental  design  we  should  choose  a  Latin  square  at  random  from  the 
whole  set  of  possibilities.  We  could,  for  instance,  first  choose  a  standard  square 
from  a  complete  set  of  standard  squares,  such  as  is  found  in  Fisher  and  Yates' 
tables  [8]  for  the  smaller  values  of  m,  and  then  permute  at  random  the  columns 
and  the  rows  (except  the  first  row).  For  this  purpose  a  table  of  random  permu- 
tations of  the  numbers  1  to  9  may  be  used — see,  for  example,  [9]. 
The  mathematical  model  for  a  Latin  square  design  is 

(9.8.1)  xijk  =  n  +  af  +  Pj  +  yk  +  eijk 

where  i,  j9  k  take  values  from  1  to  m,  but  where  only  m2  sets  of  triples  (/,  j,  k) 
are  permissible,  these  being  dictated  by  the  particular  Latin  square  used.  The 
aii  are  the  main  effects  for  the  first  factor  (say  treatments),  the  Pj  are  those  for 
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the  second  factor  (say  rows)  and  the  yk  those  for  the  third  factor  (say  columns). 
The  sijk  are  assumed  to  be  independently  normal  with  expectation  0  and  variance 
a2.  The  mean  values  for  af,  pj  and  yk  are  adjusted  to  zero  as  usual.  This  model 
ignores  any  interaction  between  the  three  main  effects.  The  null  hypotheses  to 
be  tested  are  (a)  all  <xt  =  0,  (b)  all  Pj  =  0,  (c)  all  yk  =  0. 

We  will  assume  that  the  effects  are  fixed,  as  in  Model  I.   The  total  sum  of 
squares  St  may  be  divided  into  four  parts : 


(9.8.2) 
where 
(9.8.3) 


St  =  Sa  +  Sb  +  Sc  +  Se 


(i,j,k)eD 


xy 


=  1 


Hjk 


-G 


D  being  the  actual  set  of  triples  (i,  j,  k)  appearing  in  the  square. 
(Sa  =  mZ(xi..-x)2  =  mZxi..2-G 

i  i 

(9-8-4)  js6  =  m£(x.,.-x)2=mXx.,.2-C 

j  J 

Sc  =  m  X  (*■  •  k  ~  x)2  =  m  X  x. .  k2  -  G 


Here  3c  is  tjie  arithmetic  mean  of  the  m2  observations  and  is  an  unbiased  esti- 
mator for  fi.  G  is  the  quantity  m2x2.  Also  xt. .  is  the  average  of  the  m  obser- 
vations on  the  Ith  treatment,  3c  .j .  is  the  average  of  the  m  observations  in  the  jth 
row,  and  3c .  .k  is  the  average  of  the  m  observations  in  the  kth  column.  The  sum 
of  squares  for  error  Sc  is  calculated  by  difference  from  Eqs.  (2),  (3)  and  (4). 

The  number  of  degrees  of  freedom  for  treatments,  rows  and  columns  is  m  —  1 
in  each  case.  Since  the  total  number  is  m2  —  1,  there  are  m2  —  3m  4-  2  degrees 
of  freedom  for  error.  The  expectations  of  the  mean  squares  are  given  in 
Table  9.7,  where  a2  is  a  symbol  for  £,  oc2/(m  —  1)  and  similarly  for  afi2 
and  g2. 

Table  9.7 
Analysis  of  Variance  for  m  x  m  Latin  Square 


Variation 

S.S. 

D.F. 

M.S. 

E(M.S.) 

Treatments  (A) 
Rows  (B) 
Columns  (C) 
Error 

Sa 
Sb 
Sc 

Se 

St 

m  -  1 

m  -  1 

m-\ 

m2  -3m +  2 

Sa/(m  -  1) 
Sb/{m  -  1) 
ScKm  -  1) 
Se/im*  -  3m  +  2) 

<72  +  W(7a2 

a2  +  map2 

a2  +  may2 

a2 

Total 

m*  -  1 
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The  four  first  sums  of  squares  in  the  above  table  are  statistically  independent, 
so  that  F-tests  of  the  three  null  hypotheses  mentioned  above  may  be  applied 
to  test  for  significant  effects  of  treatments,  rows  or  columns. 

If  there  are  interactions  between  these  effects  the  interpretation  of  the  results 
of  the  experiment  is  complicated  and  difficult.  Usually  the  expectation  of  the 
error  mean  square  will  be  increased,  but  the  effect  of  interaction  on  the  other 
mean  squares  may  be  to  increase  or  reduce  them. 

*  9.9  Balanced  Incomplete  Block  Designs  In  the  complete  randomized  block 
design  of  §  9.3,  every  treatment  appears  in  every  block,  possibly  with  several 
replications  as  well.  But  it  is  sometimes  advisable  to  have  smaller  blocks,  the 
size  of  which  is  dictated  by  special  circumstances.  If  one  were  carrying  out 
tests  on  the  life  of  automobile  tires  of  various  makes,  the  four  wheels  of  a  car 
would  form  a  natural  "block." 

We  shall  consider  a  fixed-effects  model,  even  though  it  might  well  be  more 
reasonable  in  some  cases  to  think  of  the  blocks  as  randomly  chosen  from  a 
large  population  of  blocks  (this  would  be  true  for  the  car  tires,  for  example, 
in  the  illustration  above).  We  suppose  that  the  blocks  are  all  of  the  same  size, 
that  each  treatment  occurs  r  times,  and  that  no  treatment  appears  twice  in  the 
same  block.  Then  if  there  are  k  "plots"  (or  items)  in  a  block,  b  blocks,  and 
a  treatments,  we  obviously  have 

(9.9.1)  N  =  ar  =  bk 

where  k  <  a. 

A  design  is  said  to  be  balanced  if  each  pair  of  treatments  occurs  in  the  same 
number  of  blocks.  If  this  number  is  denoted  by  A,  it  follows  that  in  a  balanced 
design 

(9.9.2)  (a  -  1)X  =  (k  -  l)r 

In  the  example  illustrated  in  Table  9.8  there  are  seven  treatments  (denoted  by 
letters  A  to  G),  and  seven  blocks,  each  of  size  4,  so  that  X  =  2.  The  pair  BD, 
for  example,  occurs  in  blocks  5  and  7  only,  and  similarly  for  every  other  pair. 


Table  9.8 

Block  Number 

1 

2 

3 

4 

5 

6 

7 

c 

A 

G 

F 

B 

E 

D 

G 

F 

E 

A 

D 

C 

B 

E 

G 

A 

B 

C 

D 

F 

F 

D 

B 

C 

G 

A 

E 

The  analysis  is  considerably  simplified  when  the  design  is  balanced.   A  further 
necessary  condition  for  such  a  design  is 

(9.9.3)  a  <  b 
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which  implies  r  >  k.  Many  designs  satisfying  conditions  (1),  (2)  and  (3)  may 
be  found  in  reference  [9],  Chapter  1 1 .  Once  a  suitable  design  has  been  selected, 
the  numbering  of  the  treatments  and  of  the  blocks,  and  the  positions  within 
the  blocks,  may  all  be  randomized. 

If  we  want  to  balance  out  the  positions  by  ensuring  that  each  treatment 
occurs  just  m  times  in  each  of  the  k  positions  in  a  block,  we  must  add  a  fourth 
condition 

(9.9.4)  b  =  ma 

orr  =  km.  This  is  satisfied  in  Table  9.8,  with  m  =  1.  Treatment  A,  for  instance, 
occurs  just  once  in  each  of  the  four  positions  in  a  block. 
The  mathematical  model  is 

(9.9.5)  xu  =  n  +  a,  +  fij  +  eu 

where  the  af  are  treatment  effects,  and  the  f$j  are  block  effects  and,  as  usual, 
J]  0Lt  =  ]£  Pj  =  0.  It  is  assumed  that  there  is  no  interaction.  The  s^  are  inde- 
pendently normal  with  expectation  zero  and  variance  a2,  but  it  must  be  observed 
that  only  some  of  the  possible  pairs  /,  j  correspond  to  actual  observations.  If 
Ktj  is  the  number  of  times  the  Ith  treatment  occurs  in  the  7th  block,  K{j  is  either 
0  or  1,  and  the  observed  values  of  xtj  correspond  to  those  pairs  for  which 
Kij  =  1  (there  are  ar  of  these,  out  of  ab  pairs  altogether). 

If  we  define  the  ith  treatment  total  gt  and  the/h  block  total  hj  by 

(9.9.6)  0f=2>u»        hj=YJxij 

j  t 

the  Ith  adjusted  treatment  total  is 

(9.9.7)  Gi-fc-k-l£*«ftj 

j 

=  9i-TJk 

where  Tt  is  the  sum  of  the  block  totals  in  which  the  itb  treatment  appears.  The 
adjustment  therefore  consists  in  subtracting  the  sum  of  the  block  averages  for 
those  blocks  in  which  the  ith  treatment  occurs. 
The  total  sums  of  squares  St  is  given  by 

(9.9.8)  S,=£x/72-G 

D 

where  D  refers  to  the  set  of  pairs  (ij)  for  which  Ktj  =  1  and  G  is  the  correction 
for  the  grand  mean, 

(9.9.9)  G=-ir~ 

This  total  sum  of  squares  is  split  into  three  parts : 
(9.9.10)  S,  =  S,.,„.  +  Sbx,  +  St 
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where  5t<e#b#  is  the  sum  of  squares  for  treatments,  eliminating  blocks,  and  Shiu 
is  the  sum  of  squares  for  blocks,  ignoring  treatments.  These  are  given  respectively 
by 

G  2 

(9.9.11)  St.e.b.=I 


(9.9.12) 


IrE 


The  number  E,  called  the  efficiency  factor  of  the  design,  is  defined  by 

a(k  -  1) 


(9.9.13) 


E  = 


k(a  -  1) 

and  is  less  than  1  because  k  <  a.  The  sum  of  squares  for  error,  Se,  is  obtained 
by  subtraction.  The  number  of  degrees  of  freedom  for  error  is  N  —  1 
-(a  -  \)  -  (b  —  I)  =  N  —  a  -  b  +  1.  The  analysis  of  variance  is  outlined 
in  the  Table  9.9. 


Table  9.9 


Variation 

S.S. 

D.F. 

E(M.S.) 

Treatments  (eliminating  blocks) 
Blocks  (ignoring  treatments) 
Error 

Hi  GflirE) 
(by  subtraction) 

a  -  1 

b  -  1 

N -a-b+ 1 

a2  +  raa2E 

C72 

Total 

YdxiS-G 

N-  1 

In  the  last  column  of  Table  9.9,  aa2  is  defined  as  usual  (for  fixed  effects)  by 

2 


(9.9.14) 


-X 


<*i 


a-  1 


Treatment  effects  may  be  tested  for  significance  by  the  ordinary  jp-test,  and 
estimated  by 


(9.9.15) 


rE 


If  it  is  desired  to  test  the  hypothesis  that  all  the  block  effects  are  zero,  the 
numerator  of  the  F-statistic  is  the  mean  square  for  "blocks  eliminating  treat- 
ments," given  by  Sbtu  /(b  —  1),  where 


(9.9.16) 


&b.e.,-L  k  +LrE     L  r 


However,  the  block  effects  are  usually  regarded  as  less  important  than  the 
treatment  effects. 
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For  details  of  the  many  other  types  of  experimental  design  used  in  practice 
see  reference  [9],  and  for  a  full  treatment  of  the  theoretical  considerations 
involved,  reference  [7]  is  invaluable. 

9.10  Departures  from  the  Assumptions  Underlying  Analysis  of  Variance 
Techniques  The  assumptions  usually  made  fall  under  four  heads :  (a)  additivity 
of  effects,  which  implies  an  absence  of  interaction,  (b)  independence  of  the 
error  terms,  (c)  normality  of  the  distribution  of  error  terms,  (d)  constancy  of 
the  variance  of  error  terms,  whatever  the  magnitude  of  the  main  effects.  Taken 
together,  these  constitute  a  severe  restriction  on  the  type  of  data  to  which  the 
techniques  of  analysis  of  variance  are  strictly  relevant.  It  is  highly  desirable  to 
know  whether  these  assumptions  can  be  relaxed  appreciably  in  practice. 

As  regards  non-normality,  some  empirical  data  on  sampling  from  artificial 
non-normal  populations  suggest  that  the  ordinary  F-test  is  fairly  robust  and 
can  be  used  without  serious  error  even  for  considerable  variations  from  nor- 
mality. Care  should  be  used  in  claiming  significance  when  the  probability  is 
near  the  border-line,  since,  on  the  whole,  non-normality  tends  to  make  results 
look  more  significant  than  they  are. 

Non-normality  does  not  introduce  any  bias  into  point  estimates  of  para- 
meters (or  linear  combinations  of  parameters)  or  of  the  components  of  variance. 
It  does,  however,  affect  the  validity  of  the  F-tests,  since  without  the  assumption 
of  normality  it  is  not  in  general  true  that  the  mean  squares  have  independent 
chi-square  distributions. 

Inferences  about  the  mean  \i  which  are  valid  for  a  normal  population  will 
also  hold  for  almost  any  non-normal  population  as  long  as  the  sample  is  very 
large.  This,  however,  is  not  true  for  the  population  variance  a2.  If  the  popu- 
lation has  a  kurtosis  y2,  the  variance  of  s2/a2,  where  s2  is  the  sample  variance, 
is  increased  for  large  N  by  a  factor  1  +  %y2,  and  this  may  seriously  affect  any 
significance  levels  or  calculations  of  power  obtained  from  normal  theory.  Most 
inferences  about  variances,  including  F-tests,  are  subject  to  uncertainty  due  to 
non-normality,  but  these  effects  are  less  when  the  design  is  such  that  equal 
numbers  of  observations  occur  in  each  cell  of  the  layout. 

As  regards  independence,  the  simplest  reasonable  alternative  is  probably  to 
assume  that  the  observations  are  serially  correlated.  That  is,  if  the  observations, 
taken  successively,  are  denoted  by  xl9  x2  . . . ,  there  is  a  constant  correlation  p 
between  xt  and  xi+1,  but  all  other  correlation  coefficients  are  zero.  It  may  be 
shown  that  under  these  conditions 

IE(x)  =  n 
n*)=£[i  +  2P(i-i)] 
E(s2)  =  <x2(l  -  2p/N) 
with  p  taking  possible  values  from  —  \  to  \.  The  probability  that  a  confidence 
interval  with  nominal  confidence  coefficient  1  —  a  does  not  cover  the  true 
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value  \i  is,  for  large  N,  not  a  but  2[1  —  0>(A)]  where  A  =  za/2(l  +  2p)_1/2,  za/2 
being  the  standard  normal  variate  exceeded  with  probability  a/2.  For  p  =  —  \ 
and  a  =  0.05,  the  probability  2(1  —  0(^4))  is  zero  and  for  p  =  \  it  is  0.166,  so 
that  there  is  a  chance  of  rather  serious  error  in  ignoring  p. 

The  effect  of  inequality  of  variances  in  the  one-way  layout  with  fixed  effects 
(§  9.2)  depends  upon  the  sample  sizes.  If  these  are  all  equal,  the  effect  is  slight, 
but  in  general  the  variance  of  the  ratio  of  mean  squares  (between  and  within 
samples)  is  increased,  even  for  large  values  of  Ni9  by  this  inequality.  The  result 
is  to  increase  the  true  probabilities  for  Type  I  errors,  in  the  usual  F-test  for 
equality  of  means,  beyond  the  nominal  value  a.  The  effect  may  be  quite  serious, 
leading  even  to  a  doubling  or  trebling  of  the  error. 

The  most  common  procedure  to  reduce  inequality  of  variance  is  to  make 
use  of  transformations  such  as  those  mentioned  in  §§  3.15,  3.16  and  4.8.  The 
logarithmic  transformation  (using  the  logs  of  the  observations  instead  of  the 
observations  themselves)  is  appropriate  when  their  percentage  error  is  approxi- 
mately constant.  Transformations  to  reduce  inequality  of  variance  often  reduce 
non-normality  also,  but  they  may  destroy  the  additivity  of  effects  which  existed 
before  the  transformation. 

9.1 1  Estimation  of  Contrasts  The  ordinary  F-test  in  an  analysis  of  variance 
determines  whether  there  is  a  significant  difference  between  a  group  of  means, 
but  the  investigator  really  wants  to  know  more  than  this — which  of  the  means 
differ  significantly  from  which  others.  One  should  not,  of  course,  pick  out  the 
two  means  which  differ  most  (out  of  k  samples,  say)  and  apply  the  ordinary  7-test 
to  these,  since  the  two  selected  in  this  way  are  clearly  not  chosen  at  random. 
There  are  k(k  —  l)/2  pairs  that  might  be  chosen,  of  which  the  selected  pair  is  one. 

Any  linear  combination  of  treatment  means  (or  other  parameters)  with 
coefficients  adding  up  to  zero  is  called  a  contrast.  A  difference  of  two  means 
such  as  at  —  a2  is  a  contrast  and  so  is  an  interaction :  yij  =  mij  —  mi.  —  m.j  +  m. . . 

Fisher  has  pointed  out  [10]  that  when  one  wishes  to  test  a  particular  contrast 
(picked  out  after  the  results  of  the  experiment  are  known),  the  F-test  having 
failed  to  demonstrate  a  significant  differentiation,  one  should  be  very  cautious 
in  claiming  significance.  He  suggests  that  if  the  contrast  is  one  out  of  k(k  —  l)/2 
possibilities,  we  should  require  significance  at  the  level  tx/[k(k  —  l)/2]  instead  of  a. 

As  an  illustration  we  may  consider  the  data  of  Example  1,  §  9.1.  The  F-test 
gives  a  non-significant  difference  between  means,  but  let  us  nevertheless  make  a 
f-test  for  the  two  samples  1  and  3,  for  which  the  estimated  difference  is  147. 
As  an  independent  estimate  of  the  population  variance,  a2,  we  may  use  the 
combined  mean  square  within  samples,  which  is  7619,  with  14  d.f.  The  t- value 
is  147  [7619Q  +  -±)]_*  =  2.20,  and  this  is  significant  at  the  5%  level.  The 
probability  of  a  value  numerically  as  great  is  about  0.045.  If,  however,  we 
consider  this  difference  as  simply  one  out  of  10,  we  should  require  a  t- value 
of  3.33  (corresponding  to  a  probability  0.005)  before  claiming  significance.  This 
would  mean  a  difference  of  means  of  at  least  222. 
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In  general  terms,  for  k  samples  each  of  size  N,  Fisher's  test  requires  us  to 
calculate  a  significant  difference  D  given  by 

(9.11.1)  D  =  tE,v(2MjN)1/2 

where  Me  is  the  mean  square  for  error,  with  v  degrees  of  freedom, 
e  =  (x/[k(k  —  l)/2],  and  tBtV  is  the  upper  s/2  point  of  the  /-distribution  for  v 
degrees  of  freedom.  Then  D  is  the  difference  which  should  be  regarded  as 
significant  at  level  a.  The  100(1  —  a)%  confidence  limits  for  the  difference 
between  two  population  means  are  xx  —  x2  ±  D,  where  xt  and  x2  are  the 
observed  sample  means.  If  only  differences  as  great  as  D  are  reported  as  sig- 
nificant, the  expected  number  of  wrong  statements  per  experiment  will  be  a. 
Thus  for  a  =  0.05  we  should  make  only  one  wrong  statement  in  20  experiments 
of  the  same  type. 

If  the  F-test  has  shown  significance  and  one  desires  to  know  which  means 
differ  significantly  from  which  others,  the  Newman- Keuls  procedure  is  perhaps 
the  best.  Let  a  be  the  number  of  levels  of  the  factor  under  consideration  (say 
treatments),  and  let  Me  be  the  mean  square  used  as  the  denominator  of  F  in 
testing  for  this  factor  and  v  the  number  of  degrees  of  freedom  corresponding  to 
Me.  Also  let  r  be  the  number  of  observations  at  each  level  of  the  factor.  The 
observed  means  are  arranged  in  order  from  the  smallest  to  the  greatest,  and  the 
test  is  carried  out  on  sub-groups  of  p  successive  means  beginning  with  the 
whole  set  (for  which  p  =  a).  For  any  such  group  the  significance  is  tested  by 
comparing  the  observed  range  with  the  critical  range,  the  latter  being  obtained 
from  the  distribution  of  the  studentized  range. 

The  studentized  range  of  p  observations,  having  an  actual  range  R  and 
coming  from  a  population  with  variance  <r2,  is  R/sv  where  sv2  is  an  independent 
estimate  of  a2  based  on  v  degrees  of  freedom.  (Note  that  the  standardized  range 
is  R/v.)  In  the  analysis  of  variance  the  estimate  is  usually  based  on  the  mean 
square  for  error. 

The  probability  integral  (distribution  function)  of  the  studentized  range  is 
given  by 

(9.11.2)  F(q)=P(Rlsv<q) 

'   ,v/2 


V 


xv-le~vx2/2G(qx)dx 


r(v/2)2(v~2)/2 

where  G(w)  is  the  probability  integral  for  the  standardized  range,  §  8.21.  Tables 
of  percentage  points  of  the  studentized  range  may  be  found  in  [11],  for  the 
appropriate  values  of  p  and  v.  If  q  is  the  upper  5  %  point,  the  critical  range  is 
found  by  multiplying  q  by  sv.  In  the  Newman- Keuls  procedure  the  observations 
are  each  a  mean  "of  r  original  observations,  and  the  estimated  sv2  is  therefore 
MJr. 

Example  3  Table  9.10  gives  measurements  of  tensile  strength  (kg/cm2) 
on  specimens  of  rubber.  There  were  five  batches,  each  batch  affording  six 
specimens.   The  mean,  range  and  variance  for  each  batch  are  given. 
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Table  9.10 


Batch  Number 

1 

2 

3 

4 

5 

177 

116 

170 

181 

177 

172 

179 

156 

190 

186 

137 

182 

188 

210 

199 

196 

143 

212 

173 

202 

145 

156 

164 

172 

204 

168 

174 

184 

187 

198 

Mean 

165.8 

158.3 

179.0 

185.5 

194.3 

Range 

59 

66 

56 

38 

27 

Variance 

468.6 

653.1 

406.0 

196.3 

111.5 

The  within-batch  estimate  of  variance  is  Me  =  (1835.5)/5  =  367.1,  with 
25  d.f.,  and  r  =  6,  so  that  sv  =  7.82.  The  value  of  q  for  v  =  25  and  p  =  5  is 
4.16,  giving  a  critical  range  of  32.5. 

Arranging  the  actual  means  in  order,  we  get 

-158.3,     165.8,     179.0,     185.5,     194.3 

For  the  whole  set,  R  =  36.0,  which  exceeds  the  critical  range,  and  we  may 
therefore  conclude  that  there  is  a  significant  difference  at  the  5  %  level  for  the 
whole  group  of  means.  This  test  can  therefore  be  regarded  as  a  substitute  for 
the  analysis  of  variance  test.  In  the  present  example  the  mean  square  between 
batches  is  1273.5  with  4  d.f.,  and  F  =  3.47.  Since  the  5%  point  of  F,  with  4 
and  25  d.f.,  is  2.76  and  the  1  %  point  is  4.18,  the  observed  value  is  significant 
at  the  5  %  level,  which  agrees  with  the  result  of  the  Newman-Keuls  test. 

We  next  proceed  to  omit  the  largest  (or  smallest)  mean  and  find  the  critical 
range  for  both  the  remaining  sets  of  4  means.  If  a  significant  difference  is  found 
we  test  groups  of  3  means,  and  so  on.  Any  time  that  a  group  of  means  is  found 
not  to  differ  significantly  it  is  underlined.  Any  two  means  not  underlined  by 
the  same  line  differ  significantly. 

The  q  values  for  p  —  2,  3,  4,  with  v  =  25,  are  as  follows  (upper  5  %  points), 
giving  the  critical  ranges  shown : 


p 

2 

3 

4 

q 

2.92 

3.52 

3.89 

c 

22.8 

27.5 

30.4 

Omitting  194.3,  the  range  of  the  means  is  27.2,  and  omitting  158.3,  it  is  28.5, 
neither  of  which  is  significant.   It  is  not  therefore  necessary  to  go  any  further. 
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The  set  of  means,  underlined,  is  as  shown  below : 

158.3     165.8     179.0     185.5     194.3 


Only  the  contrast  of  the  first  and  last  is  significant  at  the  5  %  level. 

Applying  the  same  method  to  the  data  of  Example  1  (and  assuming  for 
convenience  that  the  samples  are  all  of  the  same  size  4),  we  find  that  the  critical 
range  for  p  =  5  is  192  while  the  actual  range  is  147.  There  is  therefore  no 
significant  difference  between  the  means,  as  was  found  previously  by  the  F-test, 
and  the  contrast  of  even  the  least  and  greatest  is  non-significant  at  the  5  %  level. 

*  9.12  The  Power  of  Analysis  of  Variance  Tests    In  calculating  the  power  of 
the  F-test,  it  is  convenient  to  make  the  change  of  variable 

(9.12.1)  "lF 


iitF  +  n2 


where  n1  and  n2  are  the  degrees  of  freedom  for  F.  Under  the  null  hypothesis  H0 
x  is  an  ordinary  beta-variate  with  parameters  \nu  \n2.  The  distribution  under 
the  alternative  hypothesis  Hl  was  worked  out  by  Tang  [12]. 

If  we  consider  the  one-way  layout,  with  k  samples  each  of  size  r,  and  if 
Sb,  Sw  are  the  sums  of  squares  between  treatments  and  within  samples  respec- 
tively, we  have 

(9.12.2)  *=_^-,        „1=Mc-l,        n2=/c(r-l) 

With  the  usual  notation  for  fixed  treatment  effects  af  (with  £  a,-  =  0),  the 
null  hypothesis  is  that  o2  =  0,  where 


(9.12.3)  '.'-ft?! 


The  alternative  hypothesis  Hx  is  that  not  all  the  af  are  0,  so  that  a2  is  not  0. 
The  population  variance  is  assumed  constant  and  equal  to  a2. 

Under  H0,  Sb/a2  has  the  y2  distribution  with  n1  d.f.  Under  Ht  it  has  the 
non-central  x2  distribution  (§  A.  13)  with  non-centrality  parameter 


<T«2 


(9.12.4)  X  =  r(k  -  l)2c 

On  the  other  hand,  SJa1  still  has  the  ordinary  x2  distribution  with  n2  d.f. 
The  density  function  for  x  under  Hi  is 

(9.12.5)  m  =  f-x^2~fTmx) 

B(inl9  in2) 
where 

(9.12.6)  tfW  =  1+ii  +  4LL^  +  ... 

nx  1!      nt{nx  +2)2! 
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with  n  =  n1  +  n2.  This  function  is  called  the  confluent  hyper  geometric  function. 
When  A  =  0,  //(0)  =  1  and  x  is  a  beta-variate. 

The  null  hypothesis  is  rejected  when  x  >  xa,  xa  being  determined  for  a 
given  Type  I  error  a  by  the  relation 


a=f/(*|A 


(9.12.7)  a=      /(x|A  =  0)  dx 

where  /  is  the  incomplete  beta-function  ratio  tabulated  by  K.  Pearson  (§  4.5). 
The  power  of  the  test  is  given  by 


(9.12.8)  P  =  l-P 


f(x)  dx 


where  f(x)  is  the  function  defined  in  Eq.  (5).  Tang's  tables  give  values  of  xa 
corresponding  to  a  =  0.05  and  0.01  and  also  the  power  P  for  selected  values 
of  A.   (Tang  denotes  our  x  by  E2,  and  our  A  by  k(j)2/2.) 

Example  4  Suppose  we  have  four  treatments,  each  with  five  replicates. 
Then  nx  =  3,  n2  =  16.  (If  the  experiment  were  done  in  randomized  complete 
blocks,  we  should  have  to  subtract  the  four  degrees  of  freedom  between  blocks 
and  so  obtain  n2  =  12.)  Suppose  the  true  treatment  effects  a£,  expressed  as 
percentages  of  /},  are  —  5,  —  4,  3,  6,  and  let  an  estimate  of  a  be  10  %  of  jl  (obtained 
perhaps  from  previous  experience).  Then  o2  —  86/3  =  28.7,  g2  =  100, 
A  =  2.15,  <t>  =  1.04,  nx  =  3,  n2  =  16.  The  tables  show  that  for  a  =  0.05,  the 
power  P  when  <t>  =  1 .04  is  about  0.31,  so  that  the  chance  of  detecting  differences 
as  great  as  those  just  mentioned  is  only  about  3  out  of  10. 

The  same  method  will  apply  to  a  Latin  square  experiment.  If  the  square  is 
of  side  m,  the  degrees  of  freedom  are  nx  =  m  —  1,  n2  =  (m  —  l)(m  —  2). 

If  we  use  Model  II  in  which  the  treatment  effects  are  random  with  variance 
a2,  the  power  of  the  test  is  a  function  of  1  +  cra2/or2  (see  §  9.5). 

Recent  investigations  on  the  effect  of  non-normality  suggest  that  moderate 
amounts  of  skewness  and  kurtosis  have  comparatively  little  effect  on  the  power, 
kurtosis  being  more  important  than  skewness  in  this  respect. 
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PROBLEMS 
A.  (§§  9.1-9.2) 

1.  Five  samples,  each  of  four  seasoned  mine-props,  were  tested  for  maximum 
load.  The  means  and  standard  deviations  of  the  maximum  load  (in  units  of  1000  lb  wt) 
were  as  follows : 


Sample 

No. 

Xi 

Si  =  (Si/nd1/2 

1 

42.0 

10.10 

2 

52.0 

12.06 

3 

65.5 

5.45 

4 

51.8 

9.54 

5 

73.5 

19.26 

Test  approximately  the  homogeneity  of  the  variances,  using  Bartlett's  method. 
2.  Prove  that  when  k  =  2,  the  sum  of  squares  between  samples  reduces  to 


Sb 


N1N2 

Ni  +  #2 


(JC1   -  X2Y 


3.  Show  that  the  f-test  for  the  significance  of  a  difference  between  two  sample 
means  is  a  special  case  (when  k  =  2)  of  the  F-test  given  in  §  9.2  for  k  samples.  Hint: 
Sb  reduces  to  the  expression  given  in  Problem  2,  and  Sw  becomes  msi2  +  W2J22.  Then 
F  =  (N  —  2)Sb/Sw  and  this  is  the  square  of  /  in  §  8.8,  when  /xi  =  /X2. 

4.  The  following  table  represents  the  yield  of  wheat,  in  bushels  per  acre,  for  trial 
plots  of  land  treated  with  four  different  levels  of  fertilizer.  Each  level  was  applied  to 
five  plots  randomly  chosen  over  a  field. 


Plot 

Treatments 

Number 

1 

2 

3 

4 

1 

21 

24 

34 

40 

2 

25 

33 

26 

47 

3 

31 

34 

38 

39 

4 

17 

39 

32 

41 

5 

26 

35 

35 

33 

Determine  whether  there  is  a  significant  treatment  effect. 

5.  Seed  yields  in  hundreds  of  pounds  per  acre  from  three  replicates  of  four  varieties 
of  alfalfa  were  as  follows : 


Varieties 

Plots 

1 

2 

3 

4 

1 

4.7 

5.1 

9.1 

4.9 

2 

5.2 

4.8 

6.3 

5.2 

3 

2.4 

3.2 

5.8 

5.3 

Does  there  appear  to  be  a  significant  difference  between  varieties  ? 
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B.  (§§  9.3-95) 

1.  The  data  represent  sugar  yields  (tons/acre)  for  nine  varieties  of  sugar  beet. 
The  design  consists  of  five  blocks  each  of  nine  plots,  with  no  replications.  The  varieties 
were  randomized  among  the  plots  in  each  block.  Test  for  significant  differences  between 
varieties,  and  between  blocks.  Hint:  The  interaction,  if  any,  is  included  with  the  error. 


Variety 

Block 

A 

B 

C 

D 

E 

F 

G 

H 

J 

1 

1.94 

1.70 

2.23 

2.14 

1.80 

1.82 

1.91 

1.90 

1.98 

2 

2.08 

1.96 

2.26 

2.08 

2.23 

2.06 

2.06 

2.25 

2.03 

3 

1.86 

1.83 

2.22 

2.16 

1.67 

2.03 

2.22 

1.92 

1.81 

4 

2.21 

1.60 

2.08 

2.16 

2.11 

1.96 

2.14 

1.99 

1.77 

5 

2.03 

2.13 

2.02 

2.17 

2.01 

2.28 

2.28 

2.02 

1.88 

2.  In  a  greenhouse  experiment  on  wheat,  four  fertilizer  treatments  of  the  soil 
and  four  chemical  treatments  of  the  seed  were  used  (including  in  each  case  a  control 
with  no  treatment).  Each  combination  was  applied  to  three  plots  which  were  placed 
at  random  in  the  available  space.  Show  that  there  is  negligible  interaction  between 
chemical  treatments  and  fertilizers,  but  a  large  effect  due  to  fertilizers. 


Fertilizer 

Chemical  Treatment 

1 

2 

3 

4 

1 

2 
3 
4 

21.4,21.2,20.1 
12.0,  14.2,  12.1 
13.5,  11.9,  13.4 
12.8,  13.8,  13.7 

20.9,  20.3,  19.8 
13.6,  13.3,  11.6 

14.0,  15.6,  13.8 

14.1,  13.2,  15.3 

19.6,  18.8,  16.6 
13.0,  13.7,  12.0 

12.7,  12.9,  13.1 
14.2,  13.6,  13.3 

17.6,  16.6,  17.5 

13.3,  14.0,  13.9 

12.4,  13.7,  13.0 
12.0,  14.6,  14.0 

3.  In  an  experiment  to  determine  whether  five  makes  of  automobile  average  the 
same  number  of  miles  per  gallon,  three  cars  of  each  make  were  selected  at  random  in 
each  of  three  cities  and  given  a  test  run  on  one  gallon  of  a  standard  gasoline.  The 
table  gives  the  number  of  miles  travelled.  Make  an  analysis  of  variance  and  determine 
whether  there  is  a  significant  effect  (a)  of  makes,  (b)  of  cities. 


Make 

Los  Angeles 

San  Francisco 

Portland 

A 
B 
C 
D 
E 

20.3,  19.8,  21.4 

19.5,  18.6,  18.9 
22.1,  23.0,  22.4 

17.6,  18.3,  18.2 
23.6,  24.5,  25.1 

21.6,22.4,21.3 
20.1,  19.9,  20.5 
20.1,  21.0,  19.8 

19.5,  19.2,  20.3 

17.6,  18.3,  18.1 

19.8,  18.6,21.0 
19.6,  18.3,  19.8 
22.3,22.0,21.6 
19.4,  18.5,  19.1 

22.1,24.3,23.8 

4.  Estimate  the  separate  treatment  effects  in  Problem  A-4  and  the  separate  variety 
effects  in  Problem  A-5,  using  Model  I. 

5.  Estimate  the  variety  effects  and  block  effects  in  Problem  B-l,  using  Model  I. 
7.  Prove  the  following  identities : 

(a)  £«  (xi..  —  x)(xij.  —  Xi..  —  x.j.  +  x)  =  0 

(b)  S«  (*•*■  _  x)(xij.  —  xi..  —  x.j.  +  x)  =  0 
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8.  In  Problem  A-4,  assume  that  the  treatments  are  random  selections  from  a  normal 
population  (e.g.,  samples  of  fertilizer  might  differ  in  the  proportions  of  an  active 
ingredient,  and  these  samples  might  be  chosen  at  random  and  applied  in  equal  amounts 
in  each  treatment).  Calculate  the  components  of  variance,  the  intra-class  correlation, 
and  the  power  of  the  F-test  corresponding  to  a  size  of  0.05. 

9.  In  an  experiment  on  yield  of  sugar  beets  (tons/acre)  there  were  two  levels  of 
irrigation  treatment  and  three  of  fertilizer  treatment,  and  each  combination  of  treat- 
ments was  carried  out  in  five  replications.  The  analysis  of  variance  table  was  as  follows : 


Variation 

S.S. 

D.F. 

M.S. 

Irrigation 
Fertilizer 
Interaction 
Error 

120.0 

221.7 

35.0 

108.0 

1 

2 

2 

24 

29 

120.0 
110.9 

17.5 
4.5 

Total 

484.7 

Assuming  that  it  makes  sense  to  regard  the  irrigation  and  fertilizer  effects  as  random 
(Model  II),  estimate  the  components  of  variance. 

C.  (§§  9.6-9.8) 

1.  Regard  Problem  B-3  as  one  of  mixed  type  (Model  III),  with  the  makes  of  cars 
fixed  and  the  cities  random.  (That  is,  the  results  will  apply  only  to  the  particular  makes 
selected  for  the  experiment,  but  there  may  be  a  wide  variety  of  possible  cities  chosen 
for  the  trials.)  Estimate  the  components  of  variance,  and  show  that  because  of  the 
high  interaction  there  is  not,  on  this  model,  a  significant  effect  as  between  makes  of 
cars. 

2.  Show  that  the  sums  of  squares  in  Eq.  (9.7.7)  may  be  written: 

Sa  =  r-1  X  (M-HL  xw)2  -  Nx2 


jk 


Sb 


i  Jk 


f1  X  (Z  xiikY 

ij      k 
ijk  ij      k 

3.  The  following  data,  from  Scheffe  [7],  purport  to  represent  breaking  strengths 
of  tissues  from  different  boxes  of  the  same  brand,  purchased  in  three  different  cities. 
From  each  box  six  tissues  were  measured.  Calculate  the  mean  squares  for  cities,  for 
boxes  within  cities,  and  for  tissues  within  boxes,  and  test  for  the  reality  of  a  city  effect 
and  an  effect  of  boxes  within  cities. 


City 

1 

2 

3 

Boxes 

1 

2 

1 

2 

3 

1 

2 

3 

4 

1.59 

1.72 

2.44 

2.27 

2.46 

1.36 

1.59 

1.73 

1.53 

1.80 

1.40 

2.11 

2.70 

2.21 

1.43 

1.50 

1.74 

1.41 

1.72 

2.02 

2.41 

2.36 

2.50 

1.48 

1.50 

1.65 

1.64 

1.69 

1.75 

2.48 

2.36 

2.37 

1.55 

1.49 

1.58 

1.51 

1.71 

1.95 

2.36 

2.16 

2.24 

1.53 

1.47 

1.49 

1.52 

1.83 

1.61 

2.36 

2.04 

2.25 

1.39 

1.63 

1.70 

1.36 
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The  boxes  may  be  supposed  chosen  at  random  from  a  large  number  in  each  city. 
Treat  the  cities  as  random  also  and  estimate  the  components  of  variance.  Hint:  Use 
the  computation  formulas  of  Problem  2. 

4.  In  a  Latin  square  layout,  let  Tc  be  the  sum  of  squares  of  the  column  totals,  7V 
the  S.S.  of  the  row  totals  and  Tt  the  S.S.  of  the  treatment  totals.  Prove  that  Sa,  Sb,  Sc 
in  Eq.  (9.8.4)  may  be  written:  Sa  =  Te/m  —  G,  Sb  =  Trim  —  G,  Sc  =  Tt/m  —  G. 

5.  The  following  table  gives  wheat  yields  (bushels/acre)  for  five  fertilizer-treatments 
of  plots  arranged  in  a  Latin  square.  Test  for  significance  of  row,  column  and  treatment 
effects,  using  Model  I. 


Columns 

Rows 

1 

2 

3 

4 

5 

1 

34(C) 

21 U) 

52(E) 

2MB) 

40(D) 

2 

33(5) 

45(£) 

47(D) 

26(C) 

25(A) 

3 

31M) 

38(C) 

34(5) 

39(D) 

38(£) 

4 

44CE) 

41(D) 

32(C) 

11(A) 

39(B) 

5 

33(D) 

35(B) 

26(A) 

46(£) 

35(C) 

Hint:  Use  the  computation  method  suggested  in  Problem  4. 

D.  (§§9.9-9.12) 

1.  Five  detergents,  lettered  A  to  E,  were  compared  as  to  number  of  soiled  dinner 
plates  washed  in  a  basin  before  the  foam  disappeared  from  the  basin.  In  each  block 
there  were  three  basins  containing  three  different  detergents,  and  the  three  dish-washers 
rotated  after  washing  each  plate.  Analyze  this  balanced  incomplete  block  experiment. 
(Data  from  Scheffe  [7].)  Test  for  differences  between  blocks  as  well  as  between 
detergents. 


Block 

Detergent 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

A 

27 

28 

30 

31 

29 

30 

B 

26 

26 

29 

30 

21 

26 

C 

30 

34 

32 

34 

31 

33 

D 

29 

33 

34 

31 

33 

31 

E 

26 

24 

25 

23 

24 

26 

2.  In  Problem  B-l,  calculate  a  significant  difference  between  pairs  of  varieties  by 
Fisher's  method,  using  0.05  as  the  value  of  a.  Which  pairs  appear  to  be  significantly 
different  at  this  level?  Hint:  To  find  the  e,  2  point  for  t,  use  one  of  the  approximations 
in  §  8.6. 

3.  Apply  the  Newman-Keuls  procedure  to  the  data  of  Problem  B-l,  using  the  5% 
level  of  significance.  Find  which  pairs  are  significantly  different  at  this  level.  Hint: 
For  v  =  32,  the  upper  5  %  values  of  q  foip  =  9,  8,  7  and  6  are  4.70,  4.58,  4.45  and  4.29 
respectively. 

4.  A  group  of  eight  different  kinds  of  alloy  steels  is  to  be  tested  for  tensile  strength. 
It  is  expected  that  the  strengths  will  be  of  the  order  of  1 50,000  lb  in2  and  that  the  standard 
deviation  for  any  one  kind  will  be  of  the  order  of  3000  lb/in2.  How  many  specimens 
should  be  used  for  each  kind  of  alloy  if  we  want  the  error  of  the  first  kind  not  to  exceed 
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5  %  and  if  we  would  like  the  probability  of  rejecting  the  .null  hypothesis  to  be  at  least 
0.9  if  in  fact  two  of  the  alloys  differ  by  10,000  lb/in2  or  more?  (Scheffe  [7]).  Hint:  The 
minimum  value  of  «ra2  satisfying  the  condition  that  two  on  differ  by  10,000  is  given  by 
putting  ai  =  —  5,000,  as  =  5,000,  and  all  the  other  en  =  0.  This  makes  aa2  > 
(50,000)/7.  By  means  of  charts  prepared  by  Pearson  and  Hartley  [13],  suitable  values  of 
r  and  <j>  can  be  read  off.  Tang's  tables  are  not  so  convenient  for  this  purpose. 
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Chapter  10 

NON-PARAMETRIC  STATISTICAL  TESTS 

10.1  Non-Parametric  or  Distribution-Free  Tests  Many  common  statistical 
tests  are  concerned  with  estimating  parameters  in  a  distribution  function  of 
known  or  assumed  form  (for  the  population),  or  with  testing  the  significance  of 
differences  between  samples  on  the  hypothesis  that  they  come  from  such  a 
population.  Thus  the  population  may  be  supposed  to  be  normal,  with  para- 
meters jx  and  a2,  and  these  may  be  estimated  by  the  sample  statistics  m  and  s2,  or 
we  may  use  an  F-test  to  decide  whether  two  samples  which  differ  in  variance 
may  reasonably  be  presumed  to  come  from  normal  populations  with  the  same 
variance  a2.  Such  tests,  since  they  deal  with  population  parameters,  are  called 
parametric. 

There  are,  however,  other  tests  which  do  not  require  assumptions  about  the 
parameters  of  the  population  from  which  the  sample  is  drawn.  These  are  called 
non-parametric,  or  distribution-free,  since  they  are  free  of  specific  assumptions 
about  the  distribution  in  the  parent  population.  Some  assumptions,  of  course, 
have  to  be  made — for  example,  that  the  observations  constituting  the  sample 
are  independent — but  these  assumptions  are  considerably  weaker  than  those 
required  for  the  usual  parametric  tests.  The  most  obvious  danger  in  using  such 
tests  as  the  7-test  and  the  F-test  is  that  the  underlying  assumption  of  normality 
may  not  be  justified.  It  is  true  that  these  tests  appear  to  be  fairly  insensitive  to 
considerable  departures  from  normality  (they  are  said  to  be  robust  tests)  but 
nevertheless  for  some  kinds  of  data  it  would  be  rash  to  assume  anything  like  a 
normal  distribution  and  in  such  cases  non-parametric  tests  should  be  used. 

Non-parametric  tests  are  generally  simple  to  apply,  not  involving  much 
computation.  In  those  cases  where  a  parametric  test  would  also  be  applicable,  a 
non-parametric  test  will  naturally  be  less  powerful  than  the  parametric  one. 
However,  if  it  is  fairly  easy  to  obtain  new  observations,  the  lack  of  power  may 
be  compensated  by  increase  of  the  sample  size,  and  the  non-parametric  test 
appeals  just  because  of  its  simplicity.  In  this  chapter  some  of  the  commoner 
tests  will  be  described.  These  tests  are  particularly  interesting  to  students  of  the 
behavioral  sciences  because  so  much  of  the  data  in  psychology,  education,  etc. 
is  of  a  kind  that  can  be  classified  or  ranked,  but  not  accurately  measured.  Good 
general  surveys  of  non-parametric  methods  may  be  found  in  references  [1],  [2], 
[3]  and  [17]. 

10.2  The  Chi-Square  Test  of  Hypotheses  Suppose  that  the  members  of  a 
population  can  all  be  placed  in  one  or  other  of  a  set  of  A:  categories.  These  may 
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be  nominal  like  "male"  or  "female,"  or  may  be  intervals  of  the  domain  of  some 
measured  variable  like  height.  Also  suppose  that  according  to  a  certain  hypothesis 
H0  the  probabilities  of  falling  in  these  classes  should  be  nl9  n2  .  •  •  nk.  This 
hypothesis  may  be  tested  by  observing  the  actual  frequencies  fl9f2  ■  ■  .fk9  in  a 
sample  of  N  items,  corresponding  to  the  respective  classes.  The  distribution  of 
the  TV  sample  items  among  the  k  classes  is  multinomial  (Appendix  A.  16),  and 
the  number  of  degrees  of  freedom  is  k  —  1,  since  the  k  frequencies  are  con- 
nected by  the  linear  relation  Y*fi  =  N-  As  shown  in  Appendix  A.  17,  the 
quantity 

has  in  the  limit  the  x2  distribution  with  k  —  1  d.f.  It  is  assumed  that  the 
quantities  7if  are  given  by  the  hypothesis  H0  and  are  not  estimated  from  the 
sample.  If  they  have  to  be  estimated,  the  degrees  of  freedom  are  reduced 
(see  §  10.3). 

The  quantity  Nnt  is  the  expected  frequency  in  the  Ith  class  for  a  sample  of 
size  N,  on  hypothesis  H09  and  may  be  denoted  by  0f.  The  proof  that  xs2  is 
approximately  distributed  like  x2  uses  Stirling's  approximation  (Appendix  A. 2) 
for  4>i\  and  therefore  the  0f  should  not  be  too  small.  It  has  been  customary  to 
require  that  all  the  0;  should  be  at  least  5,  but  some  studies  [4]  suggest  that 
values  as  low  as  1  may  sometimes  be  tolerated  without  causing  serious  error.  In 
practice,  the  classes  with  low  expected  frequency  usually  come  near  the  ends  of 
the  distribution,  and  it  is  common  practice  to  combine  or  "pool"  the  end-classes 
until  the  4>t  reach  a  satisfactory  size.  The  objection  to  pooling  is  that  some 
important  differences  between/;  and  </>f  in  the  end-classes  may  be  hidden  by  this 
treatment.  Since  xs2  depends  on  the  square  of/f  -  <j>i9  the  sign  of  this  difference 
is  ignored,  and  yet  it  might  well  be  of  significance  for  H0  if  the  sign  were  constant 
over  several  classes  near  the  end  of  the  table.  For  this  reason,  we  recommend 
that  pooling  should  be  done  cautiously,  only  when  any  expected  frequency 
would  otherwise  fall  close  to  or  below  1 .  Too  much  pooling  reduces  the  chance 
of  rejecting  H0  if  it  really  should  be  rejected. 

Example  1  The  Abbe  Mendel,  in  a  now  classic  experiment  on  heredity, 
observed  the  shape  and  color  of  peas  from  a  number  of  plants  in  the  first- 
generation  progeny  of  a  cross.  He  found  that  they  could  be  put  into  four 
groups,  as  follows : 

Round  and  yellow  315 

Round  and  green  108 

Wrinkled  and  yellow  101 

Wrinkled  and  green  32 

According  to  his  theory  of  heredity,  these  frequencies  should  be  in  the  ratio 
9:3:3:1.  The  expected  frequencies  </>,-  for  a  total  of  556  should  therefore  be  as 
shown  in  Table  10.1 
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ft 

& 

ft-ti 

(fi  -  <«2 

(fi  ~  Witt 

315 
108 
101 

32 

312.75 
104.25 
104.25 

34.75 

2.25 

3.75 

-3.25 

-2.75 

5.06 
14.06 
10.56 

7.56 

0.016 
0.135 
0.101 
0.218 

556 

556 

0 

0.470 

From  the  data,  x2  =  0.470,  with  3  d.f.  (since  here,  k  =  4).  The  probability 
of  a  value  of  x2  as  great  as  this  is  about  0.92,  so  that  the  agreement  of  theory 
and  experiment  is  very  good.  Considerably  larger  disagreement  might  be 
expected,  even  when  H0  is  true. 

Very  occasionally,  one  encounters  values  of  xs2  so  l°w  that  the  corresponding 
probabilities  are  as  high  as  0.99.  When  these  are  not  due  to  mistakes  in  calcu- 
lation, it  may  be  suspected  that  the  observations  are  not  really  random.  The 
hypothesis  H0  should  not,  of  course,  be  rejected  merely  because  of  an  agreement 
that  is  "too  good  to  be  true,"  but  this  kind  of  agreement  might  well  be  a  ground 
for  critical  reappraisal  of  the  data. 

10.3  The  Chi-Square  Test  of  Goodness  of  Fit  An  observed  frequency 
distribution  in  a  sample  may  often,  on  general  theoretical  grounds,  be  supposed 
to  arise  from  a  true  binomial,  Poisson,  normal,  or  some  other  known  type  of 
distribution  in  the  population.  This  hypothesis  may  be  tested  by  comparing  the 
observed  frequencies  in  various  classes  with  those  which  would  be  given  by  the 
assumed  theoretical  distribution.  Usually,  however,  the  parameters  of  this 
distribution  will  not  be  known  from  prior  considerations  but  will  have  to  be 
estimated  from  the  sample.  It  may  be  shown  (see  for  example  [5])  that  if  s 
parameters  are  estimated  by  the  method  of  maximum  likelihood,  the  limiting 
distribution  of  x2  is  that  of  x2  with  k  —  s  —  1  d.f.  Each  additional  parameter 
estimated  from  the  sample  introduces  in  effect  a  linear  restriction  on  the  variates 
zh  namely,  (/)  —  </>,-)/(</>  ;)1/2>  whose  squares  are  added  to  produce  #s2,  and  so 
reduces  the  degrees  of  freedom  by  1 .  The  estimators  used  for  the  parameters  need 
not  be  the  maximum  likelihood  ones,  as  long  as  they  are  asymptotically  normal 
and  asymptotically  most  efficient  (see  §  5.6). 

Example  2  Rutherford  and  Geiger  (Phil.  Mag.  20,  1910,  p.  698)  obtained 
the  following  distribution  of  the  number  (x)  of  a-particles  emitted  from  a  disc 
in  7.5  sec. 

Assuming  that  the  distribution  is  Poisson  with  parameter  /z,  the  maximum 
likelihood  estimator  of  \i  is  the  arithmetic  mean  x,  which  is  3.870.  The  calculated 
frequency  <j>  for  any  given  x  is  Ne~iijxxlx\  with  ft  =  3.870  and  N  =  2608.  The 
last  two  values  of  (j>  in  Table  10.2  have  been  pooled  to  give  a  total  greater  than  1 . 
For  13  classes,  with  one  parameter  estimated  from  the  sample,  the  number  of 
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degrees  of  freedom  is  1 1  and  the  probability  P  of  a  value  of  %2  as  great  as  12.99 
is  about  0.30.  The  hypothesis  that  the  distribution  in  the  population  is  Poisson  is 
therefore  not  rejected  by  this  test. 

Table  10.2 


X 

/ 

</> 

(/-#•# 

0 

57 

54.40 

0.124 

1 

203 

210.52 

0.269 

2 

383 

407.36 

1.457 

3 

525 

525.50 

0.001 

4 

532 

508.42 

1.094 

5 

408 

393.52 

0.533 

6 

273 

253.82 

1.450 

7 

139 

140.32 

0.012 

8 

45 

67.88 

7.713 

9 

27 

29.19 

0.164 

10 

10 

11.30 

0.150 

11 

4 

3.97 

0.000 

12 
>13 

J2 

2608 

■-J..80 

0.022 

2608.00 

12.99 

*  10.4  The  Power  of  the  Chi-Square  Test  The  power  function  of  the  %2  test 
cannot  be  computed  unless  a  specific  alternative  hypothesis  H1  is  considered. 
We  might  for  instance  suppose  in  Example  2  that  the  distribution,  instead  of 
being  Poisson,  is  really  binomial,  and  then  we  could  find  the  probability  of 
rejecting  H0  when  it  should  be  rejected.  Another  alternative  hypothesis  might 
be  that  the  observations  are  individually  Poisson,  but  with  means  that  vary  in 
some  systematic  way. 

For  very  large  samples,  the  chi-square  test  will  usually  reject  the  hypothesis 
that  the  distribution  follows  some  simple  assumed  law,  because  small  variations 
from  this  law  will  tend  to  show  up  in  so  large  a  sample.  With  small  samples  the 
test  is  not  very  sensitive.  There  is  a  parametric  test  of  the  Poisson  distribution 
which  also  depends  on  #2,  and  which  is  more  powerful  than  the  ordinary  chi- 
square  test.  In  a  Poisson  population  the  expectation  and  the  variance  are  equal, 
and  in  a  sample  of  size  TV"  from  such  a  population  the  ratio  of  ns2/m  (where  m 
and  s2  are  the  sample  mean  and  variance  respectively  and  n  =  N  —  1)  is  dis- 
tributed as  x2  with  n  degrees  of  freedom.  In  Example  2  above,  the  ratio  of  vari- 
ance to  mean  is  0.95,  and  ns2/m  is  2476.  For  such  a  large  value,  the  normal 
approximation  to  y2  is  adequate  (§  4.6),  and  we  find  that  the  probability  of  a 
value  as  low  as  this  with  2607  d.f.  is  about  0.033.  The  variance  is  therefore 
significantly  lower  than  the  mean,  and  we  might  be  inclined  to  reject  the 
hypothesis  of  a  Poisson  distribution  on  account  of  this  test,  whereas  the 
ordinary  chi-square  test  would  lead  to  acceptance. 
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10.5  The  Chi-Square  Test  for  a  Grouped  Distribution  If  a  measured  variate 
X,  which  may  be  supposed  to  have  a  continuous  distribution  in  the  population, 
is  grouped  into  classes  in  a  sample,  the  expected  frequencies  corresponding  to 
these  classes  may  be  calculated  if  we  are  prepared  to  make  some  assumptions 
about  the  population.  Generally,  we  assume  that  the  population  distribution 
follows  some  relatively  simple  and  plausible  law,  such  as  the  normal  law  or  one 
of  the  other  Pearson  types,  with  parameters  that  are  estimated  from  the  sample 
itself.  The  agreement  between  the  observed  and  calculated  frequencies  may  then 
be  tested  by  the  chi-square  test.  The  number  of  degrees  of  freedom  is  k  —  s  —  1, 
where  k  is  the  number  of  classes  in  the  sample  (after  pooling  if  necessary)  and  s 
is  the  number  of  parameters  estimated  from  the  sample.  (One  further  degree  of 
freedom  is  lost  because  of  the  forced  agreement  between  the  total  frequencies, 
calculated  and  observed.) 

Example  3  The  data  of  Table  2.2  may  perhaps  arise  from  a  normal  distri- 
bution of  weights  in  the  population  sampled  (eight-year-old  Glasgow  girls). 
If  they  do  so,  unbiased  estimates  of  the  expectation  \i  and  the  variance  a2  for  this 
distribution  are  provided  by  the  statistics  kx  and  k2  which  were  calculated  in 
§5.10,  namely,  47.71  lb  and  33.341b2.  In  order  to  find  the  expected  frequencies  we 
need  to  obtain  the  standardized  z-values  corresponding  to  the  class  boundaries, 
these  values  being  given  by 

xp  —  a      xp  —  41.11 
(10.5.1)  z=^=^__ 

For  each  z  a  table  of  the  normal  law  gives  the  probability  of  a  value  not 
greater  than  z,  i.e., 


(10.5.2) 


<D(z) 


I"'    ftii 

J  —  oo 


)du 


Table  10.3 


Class 
Boundary  (xe) 

Observed 
Frequency  (/*) 

Zi 

<D(zO 

<j>i  =  NA®(zi) 

(-00) 

(-00) 

0.0000( 
0.0025  J 

2.5 

31.5  lb 

1 

-2.808 

35.5 

14 

-2.115 

0.0172 

14.7 

39.5 

56 

-1.422 

0.0775 

60.3 

43.5 

172 

-0.729 

0.2330 

155.5 

47.5 

245 

-0.0367 

0.4854 

252.4 

51.5 

263 

0.656 

0.7441 

258.7 

55.5 

156 

1.349 

0.9113 

167.2 

59.5 

67 

2.042 

0.9774 

68.1 

63.5 

23 

2.734 

0.9969 

17.5 

67.5 

3 

3.427 

0.9997) 

3.1 

(OO) 

1000 

(oo) 

1.0000/ 

1000.0 
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The  difference  between  successive  values  of  0(z),  denoted  in  Table  10.3 
by  AO(z),  are  the  probabilities,  for  a  random  item,  of  falling  in  the  corres- 
ponding classes,  so  that  the  expected  frequencies  are  given  by 

(10.5.3)  <t>t  =  NAO(z.) 

These  are  calculated  in  Table  10.3. 

The  first  class  includes  all  values  of  z  from  —  oo  up  to  —2.808;  the  last  class 
includes  all  values  from  2.734  up  to  oo  (although  actually  no  values  were 
observed  beyond  3.427).  From  the  columns  for/f  and  0f  we  obtain 

Xs2  =6.78  with  10-3=  7  d.f. 

There  is  probably  no  real  need  to  pool,  although  many  writers  would  advocate 
pooling  the  first  two  and  the  last  two  classes.  If  this  is  done  xs2  becomes  4.82 
with  only  5  d.f.  The  value  of  P  with  either  procedure  turns  out  to  be  about 
0.45.  The  conclusion  is  that  the  normal  curve  is  a  good  fit  to  the  data;  the 
hypothesis  of  a  normal  distribution  is  certainly  not  rejected  by  this  test. 

Here  again,  other  tests  of  normality  than  the  chi-square  test  are  possible. 
One  test  is  based  on  the  observed  sample  skewness  and  kurtosis,  both  of  which 
should,  of  course,  fluctuate  around  zero  (the  population  values).  For  this 
sample  we  find  gt  —  0.114  and  g2  =  0.104.  For  samples  of  size  1000  from  a 
normal  population,  the  standard  errors  of  gt  and  g2,  found  from  Eqs.  (8.18.4) 
and  (8.18.5),  are  0.077  and  0.154  respectively,  so  that  the  observed  values  differ 
from  zero  by  1.47  and  0.68  times  their  standard  errors.  The  probabilities  of 
discrepancies  numerically  as  great  as  these  are  0.14  and  0.50,  and  therefore,  even 
by  these  tests  the  assumption  of  normality  is  justified. 

10.6  The  Kolmogorov  Test  Like  the  chi-square  test,  this  test  is  one  of 
agreement  between  an  empirical  distribution  and  an  assumed  theoretical  one, 
but  it  is  based  on  the  cumulative  distribution  function  rather  than  on  the 
frequencies  in  the  separate  classes.  It  may  also  be  used  to  test  whether  two 
samples  may  reasonably  be  regarded  as  coming  from  the  same  population. 

For  the  one-sample  test,  suppose  that  H0  specifies  a  distribution  function 
F(x)  for  the  variate  x.  Also  suppose  that  SN(x)  is  the  observed  cumulative 
relative  frequency  in  a  sample  of  N  corresponding  to  any  given  x;  that  is,  if  the 
number  of  observations  <  x  is  k,  then 

(10.6.1)  SN(x)  =  k/N 

We  should  expect  that  if  H0  is  true,  SN(x)  will  be  a  fairly  good  approximation 
to  Fix),  and  will  be  better  as  N  increases.  In  fact,  according  to  the  strong  law  of 
large  numbers,  SN(x)  tends  to  F(x)  with  probability  1.  The  test  function  is 
DN,  the  least  upper  bound  (practically,  the  maximum)  of  the  absolute  deviation 
of  SN(x)  from  F(x) : 

(10.6.2)  DN  =  l.u.b.  \SN(x)  -  F(x)\ 
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and  the  usefulness  of  the  test  depends  on  the  fact  that  the  distribution  of  DN 
does  not  depend  on  the  form  of  F(x),  as  long  as  F(x)  is  continuous.  We  can 
therefore  take  F(x)  =  x,  0  <  x  <  1 . 

It  was  proved  by  Kolmogorov,  and  the  proof  was  simplified  by  Feller  [6], 
that  for  any  given  X  >  0, 


(10.6.3) 
where 


lim  P(Nl/2DN  >X)  =  L(X) 

N-oo 


HQ -2  I(-l)"+'e 


-2n2X2 


Thus  for  X  =  1.36,  L(X)  =  0.05.   The  asymptotic  probability  that  DN  exceeds 
1.36AT"1/2  is  therefore  0.05. 

Critical  values  for  DN,  for  small  values  of  N9  were  calculated  by  Massey  [7], 
and  a  table  is  given  in  Appendix  B.6.  This  table  gives  the  values  which  are 
exceeded  with  the  given  probabilities  and  so  corresponds  to  the  upper  tail  of  the 
distribution  of  D.  For  values  of  N  larger  than  35  the  asymptotic  distribution  of 
Eq.  (10.6.3)  may  be  used.  This  is  given  in  the  last  line  of  the  table. 

Example  4  (Miller  [8])  Can  the  following  five  numbers  be  regarded  as  a 
random  choice  from  the  interval  0  to  1 :  0.52,  0.65,  0.13,  0.71,  0.58? 

Here  S5(x)  is  a  step  function  with  steps  of  equal  height  at  the  observed  values 
of  x  (see  Figure  45).  The  graph  of  F(x)  is  a  straight  line  from  the  origin  to  (1,  1). 


0    0.13 


0.52  ]  [  0.71 
0.58'  W 

Fig.  45    Kolmogorov  test 


The  maximum  absolute  deviation  is  0.32,  so  that  the  probability  is  more  than 
20.0  that  this  value  of  D5  would  be  exceeded  if  H0  were  true.  There  is  therefore 
no  reason  to  reject  H0. 

For  N  =  5  there  is  a  probability  0.05  that  D  >  0.565.  We  should  therefore 
expect  that  in  about  95  %  of  trials  a  random  sample  of  5  from  a  uniform  distri- 
bution on  the  interval  (0,  1)  will  give  a  step  function  lying  inside  the  band 
bounded  by  F(x)  =  x  ±  0.565. 
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Corresponding  to  an  observed  step  function  and  a  given  N,  we  can  form  a 
confidence  belt,  by  drawing  the  curves  for  SN(x)  ±  DN,  where  DN  is  the  critical 
value  for  the  given  N  and  a  given  level  of  significance.  Any  theoretical  F(x)  which 
lies  wholly  within  the  belt  will  not  be  rejected  by  the  data,  at  this  level. 

Example  5  In  Table  10.4,  X  is  the  logarithm  of  soil  resistance  (in  ohms)  at 
a  certain  depth  and  k  is  the  cumulative  frequency  of  observations. 

Table  10.4 


X 

k 

Sn(x) 

1 

0 

0 

1.699 

2 

0.056 

2 

11 

0.306 

2.699 

18 

0.500 

3 

19 

0.528 

3.699 

22 

0.611 

4 

23 

0.639 

4.477 

27 

0.750 

5 

30 

0.833 

5.699 

34 

0.944 

6 

36 

1.000 

For  N  =  36  we  may  take  DN  =  0.23  for  a  95  %  confidence  belt.  The  belt  is 
drawn  in  Figure  46.  Any  F(x)  which  lies  wholly  within  the  belt  could  be 
accepted  with  95  %  confidence. 
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Fig.  46    Confidence  belt  for  kolmogorov  test 


*  10.7  The  Power  of  the  Kolmogorov  Test  This  test  is  correctly  used  only 
when  the  hypothetical  distribution  is  completely  specified,  as,  for  instance,  in 
testing  whether  a  distribution  is  normal  with  given  mean  and  variance.    In 
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practice,  when  normality  is  tested,  the  mean  and  variance  are  generally  estimated 
from  the  sample  itself.  The  effect  of  this  on  the  x2  test  is  merely  to  reduce  the 
degrees  of  freedom,  but  the  effect  on  the  Kolmogorov  test  is  not  precisely  known. 
It  may  be  expected  that  the  general  effect  will  be  to  reduce  the  critical  level  of  D, 
so  that  the  use  of  the  tabulated  values  in  such  cases  will  be  conservative. 

In  order  to  obtain  the  power  of  the  Kolmogorov  test,  we  need  to  have  an 
alternative  hypothesis  Hr  to  the  hypothesis  H0  under  examination.  If  H0 
states  that  the  population  distribution  function  is  F0(x)  and  Hx  states  that  it  is 
Fx(x)  and  if  the  maximum  absolute  difference  between  these  is  d,  then,  as 
Massey  [7]  has  shown,  the  power  of  the  test  is  at  least 

1  -  <D(2<5JV1/2  +  2DN)  +  0>(2<5JV1/2  -  2DN) 

where  DN  is  the  critical  value  corresponding  to  the  level  of  significance  a.  The 
actual  power  is  likely  to  be  considerably  greater  than  this. 

The  power  of  the  %2  test  in  general  is  not  known,  but  in  some  cases  where 
comparison  with  the  Kolmogorov  test  is  possible,  it  appears  that  the  latter  is 
much  the  more  powerful  of  the  two.  The  least  maximum  absolute  deviation  of 
the  true  distribution  function  Ft(x)  from  an  assumed  distribution  function 
F0(x),  which  will  lead  to  rejection  of  the  latter  with  probability  0.50,  has  been 
calculated  for  both  tests,  at  the  5  %  and  1  %  significance  levels,  and  is  smaller  for 
the  Kolmogorov  test  by  a  factor  of  nearly  2  (for  N  between  200  and  2000). 

The  Kolmogorov  test  is  not  applicable  to  discrete  variates,  whereas  the  %2 
test  can  be  used  for  these.  In  other  respects  the  former  test  seems  to  have 
considerable  advantages. 

10.8  The  Kolmogorov-Smirnov  Test  for  Two  Samples  This  test  is  con- 
cerned with  the  agreement  between  two  sets  of  observed  values,  and  the  null 
hypothesis  is  that  the  two  samples  come  from  populations  with  the  same 
distribution  function  F(x).  The  test  statistic  is 

(10.8.1)  Dmn=lu.b.\Sm(x)-Sn(x)\ 

where  m  and  n  are  the  sample  sizes  and  Sm(x)  has  the  same  meaning  as  before. 

If  it  is  desired  to  test  the  null  hypothesis  against  the  alternative  hypothesis 
that  the  first  sample  comes  from  a  population  in  which  F(x)  is  greater  than  it  is 
for  the  same  x  in  the  second  population  difference,  we  should  use  the  actual 
5m(x)  —  Sn(x)  instead  of  its  absolute  value. 

The  asymptotic  distribution  of  D  was  worked  out  by  Smirnov,  who 
showed  that,  as  long  as  F(x)  is  continuous, 

(10.8.2)  lim  P(N  1/2Dmn  >  A)  =  L(A) 

m,n-»  oo 

where  L(X)  is  defined  as  in  Eq.  (10.6.3),  N  =  mn/(m  +  «),  and  it  is  supposed  that 
m  and  n  both  tend  to  oo  in  such  a  way  that  the  ratio  m/n  is  finite. 

A  table  of  probabilities  that  D  <  k/n,  for  the  case  m  —  n  has  been  compiled 
by  Massey  [9]  and  is  the  basis  for  the  following  table  of  critical  values.  The  null 
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hypothesis  is  rejected  at  significance  level  <a  if,  for  samples  of  size  n,  the 
maximum  difference  between  the  cumulative  frequencies  for  any  x  is  k  or  more. 

Table  10.5 


a  =  0.05 

a  = 

0.01 

n 

k 

n 

k 

n 

k 

n 

k 

4 

4 

19 

9 

5 

5  ! 

19 

10 

5 

5 

20 

9 

6 

6 

20 

11 

6 

5 

21 

9 

7 

6 

21 

11 

7 

6 

22 

9 

8 

7 

22 

11 

8 

6 

23 

10 

9 

7 

23 

11 

9 

6 

24 

10 

10 

8 

24 

12 

10 

7 

25 

10 

11 

8 

25 

12 

11 

7 

26 

10 

12 

8 

26 

12 

12 

7 

27 

10 

13 

9 

27 

12 

13 

7 

28 

11 

14 

9 

28 

13 

14 

8 

29 

11 

15 

9 

29 

13 

15 

8 

30 

11 

16 

10 

30 

13 

16 

8 

35 

12 

17 

10 

31 

13 

17 

8 

40 

13 

18 

10 

32 

13 

18 

9 

For  large  values  of  m  and  n,  the  values  of  X  in  Eq.  (2)  which  would  justify 
rejection  of  the  null  hypothesis  at  level  of  significance  a  are  given  in  Table  10.6, 
calculated  by  Smirnov  [10]. 

Table  10.6 

a     |     0.10    0.05    0.025    0.01     0.005    0.001 
X     |     1.22     1.36     1.48       1.63     1.73       1.95 

Example  6  Suppose  that  in  two  samples  of  sizes  55  and  60  respectively 
we  find  on  drawing  the  cumulative  step  functions  Sm(x)  and  Sn(x)  that  the  maxi- 
mum absolute  deviation  is  0.25.  Then  AT1/2Z>m/l  =  [(55)(60)/115]1/2(0.25) 
=  1.34.  The  probability  of  a  value  as  great  as  this  is  a  little  more  than  0.05,  so 
that  the  difference  is  not  quite  large  enough  to  reject  the  null  hypothesis  at  the 
0.05  level. 


10.9  The  Sign  Test  for  Paired  Samples  The  ordinary  Mest  for  the  signifi- 
cance of  an  observed  effect  in  paired  samples  (§8.9)  assumes  that  all  the  paired 
differences  can  be  regarded  as  independently  and  normally  distributed  with  a 
common  variance.  Sometimes  the  pairs  are  observed  under  widely  different 
conditions  and  the  assumptions  of  normality  and  of  common  variance  seem 
unwarranted.  If  so,  the  sign  test  may  be  used.  This  test  is  very  simple  to  apply 
and  merely  assumes  that  the  median  of  the  population  of  differences  is  /z,  so  that 
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the  probability  that  an  observed  d{  >  \i  is  the  same  as  the  probability  that 
dt  <  fi. 

The  null  hypothesis  is  that  fi  =  0  and  the  alternative  hypothesis  (for  a  one- 
tailed  test)  is  that  \i  >  0,  or  that  fi  <  0.  For  a  two-tailed  test  the  alternative  is 
that  |  \i  |  >  0. 

On  the  null  hypothesis,  the  expected  numbers  of  positive  signs  and  of  negative 
signs  among  the  differences  in  a  sample  of  N  pairs  will  be  N/2.  The  sampling 
distribution  of  the  number  of  positive  (or  negative)  signs  will  be  binomial  with 
9  =  \.  A  table  of  cumulative  binomial  probabilities  for  6  =  \  is  included  in  the 
Appendix,  Table  B.7.  This  gives,  for  N  between  5  and  25  inclusive,  the  proba- 
bilities of  occurrence  of  r  or  fewer  successes,  where  r  <  N/2.  For  the  two-tailed 
test,  the  probabilities  should  be  doubled. 

In  practice,  we  let  r  be  the  number  of  less  frequent  signs  among  the  differences 
d{  (=  xu  —  x2t\  so  that  the  condition  r  <  N/2  is  satisfied.  If  any  observed 
differences  happen  to  be  exactly  zero,  they  are  not  counted,  and  the  sample 
size  is  correspondingly  reduced. 

Example  7  The  data  in  Table  10.7  represent  yields  in  bushels  for  two 
varieties  of  apples,  A  and  B,  each  pair  of  trees  being  planted  near  together  under 
similar  conditions  of  soil,  moisture,  etc.  The  separate  pairs  are,  however, 
scattered  over  various  localities. 

Table  10.7 


xi(A) 

X2(B) 

Xl  —  X2 

Xl   —  (X2   +  1) 

13 

16 

-3 

-4 

12 

11 

1 

0 

10 

8 

2 

1 

6 

6 

0 

-1 

13 

12 

1 

0 

15 

15 

0 

-1 

19 

14 

5 

4 

10 

9 

1 

0 

11 

8 

3 

2 

11 

11 

0 

-1 

13 

13 

0 

-1 

9 

10 

-1 

-2 

14 

12 

2 

1 

12 

11 

1 

0 

12 

9 

3 

2 

In  the  column  of  differences,  xt  —  jc2,  there  are  two  minus  signs  in  1 1  non- 
zero items.  The  probability  of  two  or  fewer  minus  signs  is  0.0327,  so  that  the 
difference  is  significant,  if  we  assume  that  A's  median  yield  is  certainly  not  lower 
than  B's.  If  the  difference  could  be  either  way,  the  probability  is  doubled,  and 
the  observed  difference  would  not  be  significant  at  the  5  %  level. 
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When,  as  in  this  case,  we  are  dealing  with  a  type  of  measurement  which  has  a 
well-defined  unit  and  a  zero,  we  can  employ  the  sign  test  to  decide  whether  or 
not  the  true  difference  reaches  a  certain  value.  Thus,  in  Table  10.7,  if  we  add 
one  bushel  to  each  of  the  x2  values  and  re-compute  the  differences,  we  get 
Column  4,  which  has  six  minus  signs  and  five  plus  signs.  The  value  of  r  is  now 
5  and  the  corresponding  probability  is  0.500.  The  difference  between  the  yields 
of  A  and  B  is  therefore  not  as  great  as  one  bushel,  in  favour  of  A. 

For  values  of  N  larger  than  25,  the  normal  approximation  to  the  binomial 

may  be  used.   The  probability  of  r  or  fewer  successes  is  approximately  <D(z), 

where 

2r  +  1  -  N 
(10.9.1)  z  =_____ 

The  power-efficiency  of  the  sign  test  is  about  95  %for  N  =  6,  but  diminishes 
as  N  increases  to  an  asymptotic  value  of  2/n  =  63  %.  This  means  that  the  sign 
test  has  about  the  same  power  for  a  sample  of  size  100,  say,  as  the  most  powerful 
test  against  the  same  alternative  for  a  sample  of  size  63.  The  most  power- 
ful test  would  in  fact  be  the  Student  /-test,  provided  that  the  assumptions  for 
this  test  are  met.  The  sign  test  has  the  advantage  that  it  can  be  used  in  circum- 
stances where  the  /-test  is  not  applicable.  For  samples  of  size  13  the  efficiency 
is  still  -about  75  %,  so  that  there  is  comparatively  little  loss  of  power  in  using 
the  simpler  sign  test  for  samples  of  moderate  size,  even  though  a  /-test  could 
legitimately  be  used. 

10. 10  The  Wilcoxon  Signed-Rank  Test  This  is  another  test  used  on  matched 
pairs,  [16],  more  powerful  than  the  sign  test  because  it  gives  more  weight  to  large 
numerical  differences  between  the  members  of  a  pair  than  to  small  differences. 
The  IN  subjects  are  divided  into  N  pairs,  each  pair  as  evenly  matched  as  possible. 
If  the  effect  of  a  difference  of  treatments  is  to  be  investigated,  the  choice  as  to 
which  member  of  any  pair  has  treatment  A  and  which  has  treatment  B  is  made 
at  random.  The  assumptions  are  that  the  differences  are  independent  con- 
tinuous variates  from  symmetrical  populations  with  common  mean  [i.  The 
null  hypothesis  is  that  \i  =  0  and  the  alternative  hypothesis  is  one  of:  \x  >  0, 
\i  <  0.  or  |/z|  >  0. 

The  observed  differences  dt  =  xu  —  x2i  (where  xx  refers  to  A  and  x2  to  B) 
are  ranked  in  increasing  order  of  absolute  magnitude  and  the  sum  of  the  ranks  is 
computed  for  all  the  differences  of  like  sign.  The  test  statistic  T  is  the  smaller 
of  these  two  rank-sums  (one  for  positive  dt  and  one  for  negative  dt).  Pairs  with 
d(  =  0  are  not  counted. 

Since  the  variate  is  supposed  continuous,  ties  should  occur  only  rarely,  but 
will  sometimes  happen  because  of  the  limited  accuracy  of  measurement.  If  two  or 
more  of  the  dt  have  the  same  magnitude  they  are  given  a  rank  which  is  the  average 
of  the  ranks  they  would  have  had  if  they  had  differed  slightly.  Thus  if  the  three 
numerically  lowest  values  of  dt  happened  to  be  —  1,  —  1  and  1,  they  would  all  be 
given  rank  2,  which  is  the  mean  of  ranks  1,  2  and  3.  (The  signs  are  disregarded.) 
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On  the  null  hypothesis,  the  expected  values  of  the  two  rank-sums  would  be 
equal.  If  the  positive  rank-sum  is  the  smaller,  and  is  equal  to  or  less  than  the 
value  given  for  the  appropriate  N  in  Table  10.8,  the  null  hypothesis  will  be  re- 
jected at  the  corresponding  level  of  significance  a,  in  favour  of  the  alternative 
hypothesis  that  \i  >  0.  If  the  negative  rank-sum  is  the  smaller,  the  alternative 
will  be  that  \i  <  0.  If  a  two-tailed  test  is  required,  the  alternative  being  that 
\\i\  >  0,  the  given  levels  of  significance  should  be  doubled. 

Table  10.8a 


TV 

a  =  .025 

a  =  .01 

a  =  .005 

6 

0 

_ 

_ 

7 

2 

0 

- 

8 

4 

2 

0 

9 

6 

3 

2 

10 

8 

5 

3 

11 

11 

7 

5 

12 

14 

10 

7 

13 

17 

13 

10 

14 

21 

16 

13 

15 

25 

20 

16 

-16 

30 

24 

20 

17 

35 

28 

23 

18 

40 

33 

28 

19 

46 

38 

32 

20 

52 

43 

38 

21 

59 

49 

43 

22 

66 

56 

49 

23 

73 

62 

55 

24 

81 

69 

61 

25 

89 

77 

68 

°  Adapted  from  Table  I  of  reference  [16]  with  the  kind  permission  of  the  author, 
F.  Wilcoxon,  and  the  publishers,  American  Cyanamid  Co. 


For  larger  values  of  N,  T  is  approximately  normally  distributed  with  mean  and 
variance  given  by 


(10.10.1) 


V- 


N(N  +  1) 


a2  =  N(N  +  1) 


2N  +  1 
24 


This  means  that  [\T  -  N(N  +  1)/4|  -  i]/d  is  approximately  a  standard  normal 
variate.  Thus  at  the  0.025  significance  level,  for  which  z  =  1.96,  with  TV  =  25, 
we  find  T«  162  -  1.96(1381)1/2  =  89.1,  in  agreement  with  the  last  fine  of 
Table  10.8. 
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Example  8     For  the  data  of  Table  10.7,  the  ranks  of  xt  —  x2  (excluding 
the  zero  values)  are  as  shown  below: 


d: 


-3     1     2 


5     1     3-1     2       1     3 


rank 


9     3     6i     3 


11 


3     9     3     6J    3     9 


The  sum  of  ranks  for  the  two  negative  dt  is  12  and  for  the  nine  positive  dt 
is  54.  Therefore  T  =  12  and  TV  =  12.  The  hypothesis  that  the  expectation  of  dt 
is  positive  is  therefore  acceptable  at  the  2.5%  level.  The  hypothesis  that  the 
expectation  is  not  zero  is  acceptable  at  the  5  %  level.  The  latter  decision  (based 
on  the  two-tailed  test)  is  the  one  that  would  normally  be  taken  unless  there  is 
good  reason  to  believe,  before  the  data  are  obtained,  that  if  there  is  any  difference 
it  can  only  be  in  one  direction. 

The  power-efficiency  of  the  Wilcoxon  test  is  remarkably  high.  Asympto- 
tically, it  is  3/7i  =  95.5%,  as  compared  with  the  f-test,  in  circumstances  where 
both  tests  would  be  applicable. 

*  10.11  The  Walsh  Test  This  is  a  test  with  assumptions  similar  to  those  for 
the  Wilcoxon  signed-rank  test,  but  depending  on  the  averages  of  pairs  of 
differences.  The  differences  dt  are  arranged  in  increasing  order,  taking  account 
of  sign.  The  null  hypothesis  is  that  the  median  ft  of  all  these  differences  is  zero. 
The  alternative  (two-tailed)  hypothesis  is  that  this  median  is  not  zero. 

The  test  statistics  used  are  various  combinations  of  the  differences.  Thus, 
for  TV  =  5,  with  a  two-tailed  test,  we  should  reject  H0  at  the  level  a  =  0.125  if 
either  \(dA  +  d5)  <  0  or  \{dx  +  d2)  >  0.  We  should  reject  at  the  level  a  =  0.062 
if  either  d5  <  0  or  dt  >  0.  If  we  felt  in  advance  that  ja  was  bound  to  be  negative 
if  not  zero,  we  could  reject  H0  at  the  level  0.062  if  \{d^  +  d5)  <  0  or  at  the  level 
0.031  if  d5  <  0.  Table  B.8  in  the  Appendix,  from  Walsh  [11],  gives  for  values 
of  TV  from  5  to  1 5  the  various  tests  which  may  be  applied  at  the  significance  levels 
indicated,  for  both  one-tailed  and  two-tailed  tests. 

Some  of  these  tests  are  equivalent  to  the  Wilcoxon  signed-rank  test,  but 
others  are  not.  The  efficiency  of  the  tests  is,  for  the  most  part,  from  95  to  99%, 
and  is  nowhere  below  87.5  %  (for  the  first  test  when  TV  =  10,  one-tailed). 

Example  9  Can  the  following  seven  observations,  arranged  in  order  of 
size,  be  considered  as  coming  from  populations  with  the  common  median  0.5? 

-2.5,     -1.5,     -1.3,     -0.1,     0.4,     0.7,    0.8 

If  we  subtract  0.5  from  each  of  these  values,  the  null  hypothesis  will  be  that 
u  =  0.  The  values  are 


rf, 

d2 

d3 

d4 

ds 

d6 

-3.0 

-2.0 

-1.8 

-0.6 

-0.1 

0.2 

d, 
0.3 
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For  N  =  7,  there  are  eight  tests  altogether,  summarized  below: 

Sig.  level  (a)  Tests 

0.105?  max[d5,  \{dA  +  </7)]  =  -0.1,        min[d3,  \{dx  +  d4)]  =  -1.8 

0.047  max[4,  i(^s  +  <*7)]  =  0-2,            min[d2,  \{dx  +  d3)~]  =  -2.4 

0.031  i(^6  +  dn)                 =  0.25           i(^i  +  d2)                 =  -2.5 

0.016  d1                           =  0.3                dt                            =-3.0 

Only  the  first  test,  at  level  a  =  0.109,  leads  to  rejection  of  //0,  since  the  value 
-0.1  is  negative.  At  the  0.047  level  we  should  accept  H0. 

Using  the  Wilcoxon  test  on  the  same  data,  we  have  T '=  5,  N  =  7,  which 
would  lead  us  to  accept  H0  at  any  level  up  to  0.05.  This  does  not  contradict  the 
Walsh  test,  but  the  latter  is  more  informative. 

10.12  The  Mann-Whitney  U-test  This  is  a  test  of  the  null  hypothesis  that 
two  independent  samples  A  and  B  come  from  populations  a  and  ft  with  the  same 
distribution.  The  alternative  (one-tailed)  hypothesis  is  that  the  variate  values  in 
population  a  are  stochastically  larger  than  those  in  ft  (or,  of  course,  smaller). 
This  means  that  if  w  is  any  item  from  a  and  b  any  item  from  /?,  the  probability 
that  a  >  b  (or  in  the  other  case  that  a  <  b)  is  greater  than  0.5.  If  this  is  so,  the 
"bulk"  of  population  a  has  larger  (or  smaller)  variate  values  than  the  bulk  of 
population  p.  As  before,  a  two-tailed  test  may  be  used,  the  alternative  hypothesis 
being  that  P(a  >  b)  is  not  0.5. 

Let  us  choose  for  sample  A  the  one  with  the  smaller  size,  if  the  sizes  differ. 
If  Nx  and  N2  are  the  sample  sizes  and  N  =  Nt  +  N2,  we  rank  the  combined 
samples  in  increasing  order  and  then  find  the  sum  of  ranks,  Rl  and  R2,  for  the 
two  samples  separately.  The  sum  Rx  +  R2  must  be  equal  to  7VX/V  -I-  l)/2,  and 
this  fact  serves  as  a  check. 

The  test  statistic  is  given  by  the  smaller  of  the  two  quantities : 

N1(N1  +  1) 
(10.12.1)  {  2 

U'  =NlN2-  U 

An  equivalent  test  using  Ri  was  first  proposed  by  Wilcoxon,  but  Mann  and 
Whitney  gave  a  more  complete  treatment,  with  more  extensive  tables. 

This  quantity  U  is  equal  to  the  number  of  times  that  an  item  in  A  precedes 
in  the  ranking  an  item  in  B.  U'  is  the  number  of  times  that  an  item  in  B  precedes 
an  item  in  A.  If  P(a  >  b)  is  large,  most  items  in  B  will  have  lower  ranks  than 
most  items  in  A  and  U  will  be  small.  The  smallness  of  U  determines  whether  the 
null  hypothesis  should  be  rejected. 
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Example  10    The  values  of  x  for  two  samples  are  as  shown: 


Sample  A  (Ni  =  8) 

Sample 

B(N2 

=  9) 

xi                Rank 

X2 

Rank 

15.2                   7 

11.5 

3 

8.6                  1 

12.6 

5 

9.3                  2 

19.4 

13 

14.4                  6 

21.3 

14 

15.6                  8 

32.5 

17 

11.8                  4 

18.6 

12 

16.3                  9 

17.0 

10 

17.8                 11 

23.4 

15 

29.6 

16 

Ri  =48 

R2  = 

105 

Here  N  =  17,  ±N(N  +  1)  =  153  =  R1  +  R2. 
From  Eq.  (1),  U  =  60,  IT  =  72  -  60  =  12. 

The  first  item  in  B  (with  rank  3)  precedes  six  items  in  A,  the  second  item  precedes 
five  items  in  A,  and  the  seventh  item  (with  rank  10)  precedes  one  item  in  A. 
None  of  the  other  items  in  B  precedes  anything  in  A.  The  total  of  precedences 
is  12,  agreeing  with  the  calculation  of  U' '.  For  values  of  Nt  and  N2  which  are 
moderately  large  (say  9  or  more)  the  sampling  distribution  of  U  (or  U')  is 
approximately  normal,  with  mean  and  variance  given  by 


A*  = 


(10.12.2) 


N1N2 
2 

N1N2(N  +  1) 
12 


This  implies  that  the  variate 


1 


N1N2(N  +  1)1  "1/2 


(10.12.3)      z  =  L — ^L_^  =  Hn^N  +  1)  -  2R,\  -  1] 

is  approximately  a  standard  normal  variate. 

For  small  values  of  Nt  and  N2,  special  tables  due  to  Mann  and  Whitney  [12] 
must  be  used.  These  give  the  probabilities  of  a  value  of  U  less  than  or  equal  to 
that  observed.  For  N2  >  9,  Table  B.9  in  the  appendix  may  be  used.  This  gives 
for  selected  significance  levels,  and  Tor  selected  values  of  N1  and  N2,  critical 
values  of  U.  Observed  values  of  U  (or  U')  less  than  or  equal  to  the  tabular  value 
are  cause  for  rejection  of  H0  at  the  level  quoted.  The  level  should  be  doubled  for 
a  two-tailed  test. 

In  the  example  above,  Nx  =  8,  N2  =  9,  U'  =  12.  The  critical  values  for 
a  =  .05  and  .01  are  18  and  11  respectively.  The  observed  U'  is  therefore  signi- 
ficant at  the  0.05  level,  and  almost  significant  at  the  0.0 1  level,  for  a  one-tailed  test. 
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The  normal  approximation  gives  \i  =  36,  a2  =  108,  z  =  —2.26,  corres- 
ponding to  a  probability  of  0.01.  The  sample  B  apparently  comes  from  a 
population  /?  with  on  the  whole  significantly  higher  x-values  than  population  a, 
at  about  the  1  %  level  of  significance. 

If,  before  obtaining  the  samples,  we  admitted  the  possibility  that  either 
population  might  have  on  the  whole  higher  values  than  the  other,  we  should 
use  a  two-tailed  test  and  say  that  a  and  /?  differ  at  the  2  %  level  of  significance. 

When  ties  occur  they  are  treated  as  in  the  Wilcoxon  test.  The  effect  of  ties 
is  to  reduce  somewhat  the  value  of  o2  in  the  normal  approximation.  If  there  are 
t  observations  tied  for  a  particular  rank  and  if  T  =  (t3  —  t)/l2,  the  corrected 
a2  is  given  by 


N,N2 
(10.12.4)  o2  = 


N(N  -1)L        12 


N(N2  -  1) 


-It] 


the  sum  being  taken  over  all  groups  of  tied  observations.    In  most  cases  the 
correction  makes  little  practical  difference. 

If  applied  to  data  which  could  be  analyzed  by  the  parametric  r-test,  the  Mann- 
Whitney  test  has  a  high  power-efficiency,  close  to  95%  and  asymptotically 
equal  to  3/n.  For  some  distributions  it  is  even  superior  to  the  f-test  in  its  power 
to  reject  H0. 

10.13  Tests  of  Randomness  It  is  sometimes  desirable  to  test  whether  a 
series  of  observations  can  be  regarded  as  random.  The  residuals  in  a  time  series, 
after  removing  a  trend,  may  be  so  tested,  for  example.  According  to  von  Mises, 
the  criterion  as  to  whether  a  series  is  random  or  not  is  that  the  relative  fre- 
quencies in  any  sub-series  of  the  given  series  shall  be  the  same  as  in  the  original 
series,  providing  that  the  series  is  very  long  and  that  the  sub-series  is  picked  out 
by  some  pre-assigned  rule.  The  sub-series  could  consist,  for  instance,  of  every 
third  term,  or  every  term  corresponding  to  a  prime  number,  but  the  rule  would 
obviously  have  to  be  independent  of  the  nature  of  the  terms  picked.  In  a  series 
consisting  of  zeros  and  ones,  the  rule  could  not  be  to  pick  all  the  zeros. 

This  criterion  is  clearly  not  a  practical  one  for  testing  the  randomness  of  a 
given  finite  series.  Various  tests  are  actually  used  in  such  a  case.  One  is  to 
determine  the  relative  frequencies  of  different  kinds  of  terms.  In  the  series  of 
digits  of  the  decimal  expansion  of  n  (3.14159  .  .  .),  now  known  to  10,000  places, 
one  can  count  the  numbers  of  0's,  l's,  2's,  etc.  If  the  distribution  were  random, 
one  would  expect  equal  numbers  of  each  digit.  The  actual  distribution  of  10,001 
digits  is  as  shown  in  the  following  table : 


Table  10.9 

Digit 

0 

1 

2     3     4 

5 

6 

7 

8 

9 

Frequency 

961 

1008 

1000  1001  1011 

1031 

1026 

1000 

953 

1010 

268  INTRODUCTION  TO  STATISTICAL  INFERENCE  10.14 

The  x2  test  for  agreement  between  the  observed  and  calculated  values  gives 
Xs2  =  6.21  for  nine  d.f.,  so  that  P  =  0.7.  The  hypothesis  of  a  random  distribu- 
tion of  the  digits  is  certainly  not  to  be  rejected,  as  far  as  this  test  goes. 

The  frequency  of  pairs  of  digits  can  also  be  compared  with  expectation,  or, 
in  sets  of  four  consecutive  digits,  the  frequencies  of  four  or  three  of  a  kind,  two 
pairs,  etc.  (the  so-called  "poker"  test),  can  be  used.  The  "gap"  test  uses  the 
average  separation  between  zeros.  All  these  tests  are  normally  made  on  sets  of 
random  numbers  produced  by  some  mechanical  process,  and  intended  to  be  used 
for  randomizing  in  experimental  work.  It  will  of  course  occasionally  happen  that 
some  sub-group  of  random  numbers  will  fail  to  pass  a  particular  test.  This  group 
should  not  be  used  by  itself,  but  may  be  quite  satisfactory  as  part  of  a  larger 
group. 

*  10.14  Runs  Up  and  Down  Several  tests  of  randomness  are  based  on  the 
occurrence  of  runs  in  a  series.  Suppose  the  observed  series  consists  of  numbers 
*i>  x2,  *3  •  •  •  ■%>  and  consider  the  signs  of  the  differences  dl  =  x2  —  xu 
d2  =  x3  —  x2  .  .  .  dN_  l  =  xN  —  xN_v  The  sequence  of  N  —  1  signs  may  look 
something  like  this : 

S:  +  +  -  + +  -  +  +  - 

The  plus  signs  correspond  to  runs  up  in  the  original  series  (increasing  values  of 
jc),  the  minus  signs  correspond  to  runs  down  (decreasing  values).  In  the  illustra- 
tion there  are  runs  up  of  length  2,  1,  1,  and  2,  and  runs  down  of  length  1,  4,  1 
and  1.  In  this  notation  three  consecutive  increasing  x  terms  give  a  run  up  of 
length  2. 

Tests  of  randomness  have  been  suggested  based  on  the  total  number  of  runs 
(r),  whether  up  or  down,  on  the  number  of  plus  signs  (k)  in  the  sequence  S,  and 
on  various  other  properties  of  the  sequence.  Moore  and  Wallis  [13]  have  des- 
cribed such  tests,  and  Levene  [14]  has  discussed  their  power  function. 

We  may  assume  that  the  original  observations  are  all  distinct.  If  two  con- 
secutive items  happen  to  be  equal,  we  must  suppose  that  with  more  accurate 
measurement  they  would  differ  and  the  difference  would  be  equally  likely  to  be 
plus  or  minus.  The  run  lengths  are  reckoned  on  both  suppositions,  and  the 
corresponding  probabilities  calculated. 

The  total  number  of  permutations  of  the  N  numbers  xl9  x2  •  •  •  xN  is  AH 
The  number  producing  exactly  k  positive  differences  is 

(10.14.1)  (i>„(k)=Yj(-i)i{N  +  \k  +  1  -  if 

The  probability  of  exactly  k  positive  differences  is  therefore  </>N(k)/N\  The  same 
expression  holds  for  the  probability  of  k  negative  differences.  The  expectation  of 
k  is  given  by 

(10,4.2)  mjf^m^ 
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as  is  obvious  from  symmetry. 
(10.14.3) 


The  variance  of  k  is 

N  +  1 


V(k)  = 


12 


The  kurtosis  of  the  distribution  turns  out  to  be  — 


and  so  tends  to 


5N  +  1 

zero  as  TV  -»  oo.  For  N  >  12,  the  distribution  of  k  is  approximately  normal.  In 
using  the  normal  approximation,  the  correction  for  continuity  should  be  applied 


k  - 


N  -  1 


by  -.   This  is  equivalent  to 


by  diminishing  the  observed  value  of 

taking  as  a  standard  normal  variate  the  quantity 

(10.14.4)  z  =  ±  (j^j  l'\\2k  -  N  +  1|  -  1) 

Example  11  In  a  time-series  of  sweet  potato  production  in  the  United  States 
over  the  years  1868-1937,  it  was  found  that  the  69  differences  were  positive  in 
37  cases  and  negative  in  32.  With  N  =  70,  we  find  z  =  +0.822.  The  probability 
of  a  value  greater  numerically  than  this  is  0.41,  so  that  the  hypothesis  of  random- 
ness can  be  accepted.  There  is  no  evidence,  as  far  as  this  test  goes,  of  a  trend  in 
the  series.  If  there  is  one  it  is  swamped  by  the  extent  of  the  random  variations. 

A  test  may  also  -be  based  on  the  total  number  of  runs  up  and  down.  If  r  is 
this  number,  and  if  we  include  the  runs  at  the  beginning  and  end,  we  find 

(10.14.5) 


1 


V(r)  = 


16N  -  29 
90 


Asymptotically,  r  is  normally  distributed. 

The  expected  frequencies  for  runs  of  given  length  may  also  be  calculated. 
If  E(rp)  is  the  expected  number  of  runs  of  length  p  exactly,  and  if  E(r'p)  is  the 
expected  number  of  lengths  p  or  more, 


(10.14.6) 


E(rx)  = 


E(r2) 


E(r'3)  = 


5N  +  1 


12 

lliV- 

-14 

60 

4N  - 

11 

60 


The  x2  test  may  be  used  for  comparing  the  observed  and  expected  values,  but 
the  distribution  of  xs2  is  not  quite  that  of  Pearson's  %2.  For  N  >  12  and  the 
three  classes  suggested  above  (/  =  1,2  and  3  or  more),  the  observed  xs2  may 


be  multiplied  by  6/7  and  referred  to  the  x2 
freedom,  at  least  as  an  approximation. 


distribution  for  two  degrees  of 
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Example  12  In  a  certain  time  series  of  36  observations,  the  number  of 
positive  differences  was  18  and  the  number  of  negative  ones  17.  The  total  number 
of  runs  up  and  down  (r)  was  25;  and  the  observed  values  for  rl9  r2  and  r'3  were 
16,  8,  and  1  respectively. 

Here  E(r)  =  23.67,  V(r)  =  6.08,  a(r)  =  2.47.  The  observed  r  is  clearly  not 
significantly  different  from  E{r),  on  the  basis  of  the  normal  approximation. 

The  expectations  of  rl9  r2  and  r'3  are  15.08,  6.37  and  2.22  respectively.  The 
value  of  xs2  is  1.143,  and  6/7  of  this  is  0.98.  For  2  d.f.,  this  is  not  significant.  The 
tests  agree  in  permitting  us  to  accept  the  hypothesis  that  the  given  series  is 
random. 

Levene  [14]  has  considered  the  power  of  tests  based  on  k  and  r.  The  null 
hypothesis  H0  is  that  the  observations  are  random.  The  alternative  hypothesis 
H1  may  be  that  there  is  a  linear  trend  in  the  observations,  or  perhaps  that  there 
is  a  cyclical  trend.  For  detecting  a  linear  trend,  the  test  with  k  is  much  more 
powerful  than  that  with  r,  but  it  appears  to  be  less  powerful  for  certain  cyclical 
trends.  Since  in  the  limit,  k  and  r  have  a  joint  normal  distribution,  and  are 
uncorrelated  under  7/0,  the  statistic 

(10147)  [k-E(k)Y      [r-E(r)Y  = 

K  n  V(k)        +        V(r)  ls 

is  approximately  a  x2  variate  with  two  degrees  of  freedom. 

Example  13  An  artificial  upward  linear  trend  was  added  to  the  time  series 
of  Example  12.  This  increased  the  number  of  positive  differences  to  21,  and 
reduced  the  number  of  negative  ones  to  14.  The  total  number  of  runs  (r)  was 
reduced  to  21.  With  the  continuity  correction,  k  -  E(k)  =  3,  V(k)  =  37/12, 
r  -  E(r)  =  -  Y>  V(r)  =  547/90,  so  that  the  expression  in  Eq.  (7)  is  108/37 
+  845/1094  =  2.92  +  0.77  =  3.69.  For  2  d.f.,  P  =  0.17,  which  indicates  that 
the  added  trend  is  not  significant  even  at  the  0. 1  level.  It  may  be  seen  that  the 
main  contribution  to  xs2  arises  from  k  rather  than  from  r,  thus  verifying  that  the 
former  is  more  sensitive  than  the  latter  to  linear  trends. 

*  10.15  Jonckheere's  Test  This  is  designed  to  test  the  null  hypothesis  that 
several  samples  are  randomly  drawn  from  the  same  population,  when  the 
alternative  hypothesis  specifies  a  certain  rank  ordering  of  the  populations.  The 
ordinary  F-test  may  be  used  to  test  this  null  hypothesis,  but  the  alternative  does 
not  involve  any  particular  rank  ordering  of  the  populations. 

If  there  are  only  two  samples  we  can,  of  course,  use  the  Mann-Whitney  test, 
or,  if  appropriate,  a  one-tailed  /-test,  the  alternative  hypothesis  being,  say,  that 
Mi  >  A*2-  Jonckheere's  test  may  be  used  with  k  samples.  One  application  is  to 
time-series,  when  on  each  occasion  the  observed  event  may  be  one  of  k  possi- 
bilities. The  null  hypothesis  would  be  that  each  of  these  occurs  at  random,  and 
the  alternative  hypothesis  that  the  different  types  of  event  tend  to  occur  in  a 
certain  definite  order. 
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For  convenience  we  will  suppose  that  the  k  samples  are  all  of  size  r  (although 
it  is  not  necessary  to  do  so).  If  the  ith  sample  comes  from  a  population  with 
distribution  function  Ffa),  the  null  hypothesis  is  that  the  Ft(x)  are  all  the  same 
function  of  x,  against  the  alternative  that  there  is  an  ordering  of  the  populations 
such  that 


(10.15.1) 


Ft(x)  <  F2(x)  <        <  Fk(x) 


for  all  x.  This  condition  will  be  satisfied  if  there  is  a  real  additive  treatment  effect, 
different  for  each  sample.  We  may  express  the  alternative  hypothesis  as 
Ft(x)  <  Fj(x)  for  all  x,  where  i  —  1,  2  . .  .  k  —  1,  and  j  =  1  +  i,  .  .  .  k. 

Let  xim  be  the  mth  item  in  the  ith  sample,  and  xjn  the  nth  item  in  the/h  sample 


(m,  n ■  =  1,  2  ...  r).    Also  let  pimJn 
define  pu  by  the  relation 


=  1  if  xim  <  xin  and  0  if  xim  >  xin,  and 


J" 


(10.15.2) 


Pu  =  X    X  Pi, 


The  greater  the  differences  between  the  distribution  functions  F^x)  and  Fj(x), 
the  larger  will  tend  to  be  the  value  of  ptj.  If  we  then  define  S  by 


(10.15.3) 


S=2YPij-ik(k-l)r2 

i<j 


the  statistic  S  may  be  used  for  testing  H0  against  H1 .  A  large  value  of  S  will  lead 
to  the  rejection  of  H0. 

The  following  example  is  given  by  Jonckheere  [15].  There  are  four  measure- 
ments on  each  of  four  samples : 

Table  10.10 


I 

II 

III 

IV 

19 

21 

40 

49 

20 

61 

99 

110 

60 

80 

100 

151 

130 

129 

149 

160 

57.25 

72.75 

97.00 

117.50 

Considering  Samples  I  and  II,  we  note  that  the  values  19  and  20  are  each  less 
than  four  values  in  II,  60  is  less  than  three  values  in  II,  and  130  is  less  than  none. 
Therefore, 

p12=4  +  4  +  3=ll 

In  the  same  way,  we  can  calculate  the  other  five  values  of  ptj  and  so  obtain 

S  =  2(11  +  12  +  13  +  11  +  12  +  12)  -  96  =  46. 

From  the  tables,  Appendix  B.10,  we  find  that  the  probability  that  S  >  46  is 
0.0168,  which  would  suggest  rejection  of  H0.  The  usual  F-test  on  the  same  data 
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gives  a  probability  of  0.346,  which  would  lead  to  acceptance  of  H0.  The  F-test, 
however,  has  a  much  wider  variety  of  alternative  hypotheses  than  the  Jonckheere 
test. 

The  quantity  S  is  really  a  measure  of  the  agreement  of  the  ranking  of  all  the 
observations  with  the  ranking  that  they  would  have  if  those  in  each  separate 
sample  were  tied.  The  ranks  in  the  example  above,  along  with  the  tied  ranks, 
are  as  shown  in  the  Table  10. 1 1,  where  for  each  sample  the  first  column  gives  the 
actual  rank  among  all  16  observations  and  the  second  column  the  ranks  that 
would  be  allotted  if  all  the  observations  in  a  batch  were  tied.  As  usual  when 
dealing  with  ties,  the  rank  is  the  arithmetic  mean  of  the  ranks  that  would  apply 
if  the  ties  were  broken;  Thus  2\  is  the  mean  of  1,  2,  3  and  4  (see  §  11.16). 

Table  10.11 


I 

II 

III 

IV 

1  2\ 

2  2\ 
6  2\ 

13  2\ 

3  6* 

7  6* 

8  6} 
12  6£ 

4  10* 

9  10* 

10  10* 

14  10| 

5  14* 
11  14* 

15  14* 

16  14* 

For  each  pair  of  observations  we  may  now  allot  a  score  of  1  if  they  are  in  the 
same  order  on  both  rankings,  —  1  if  they  are  in  opposite  orders,  or  0  if  they  are 
tied  on  either  ranking.  The  sum  of  these  scores  is  the  quantity  S,  and  it  is  easily 
checked  that  S  =  46.  The  first  observation  in  I  and  the  first  in  II,  for  instance, 
contribute  1,  since  1  and  3  are  in  the  same  order  as  2*  and  6*.  This  quantity 
Sis  used  in  calculating  rank  correlation  by  Kendall's  method  (see  §§  11.14 
and  11.17). 

The  statistic  S  has  a  symmetrical  distribution  with  expectation  zero  and 

variance  a2  =  —  {2N  +  3  —  (2r  +  3)//:},  where  N  =  kr.   For  large  samples, 
18 

Sjo  is  approximately  normal,  especially  if  a  continuity  correction  is  applied  by 

subtracting  1  from  S  before  dividing  by  a.    A  better  approximation  is  the 

following : 


(10.15.4) 


S[(v  +  1)^-d1/2*' 


i.e.,  it  has  the  Student-^  distribution  with  v  degrees  of  freedom.  The  quantity  v 
depends  on  the  kurtosis  of  the  distribution  of  S  and  is  given  by 


(10.15.5) 


v  +  3 


1350<7' 


N\6N2  +  15N  +  10)  -  kr\6r2  +  15r  +  10) 


In  the  example,  a  =  21.42  and  the  approximate  z  value  is  45/(7  =  2.10, 
which  corresponds  to  a  probability  0.018.  The  value  of  v  turns  out  to  be  36  and 
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tv  =  2.21.  The  probability  for  a  value  of  t  exceeding  2.21  is  0.017  which  is  very 
close  to  the  correct  value. 

When  there  are  only  two  samples,  S  reduces  to  r2  —  2U,  or  r2  -  2U',  where 
U  is  the  statistic  used  in  the  Mann- Whitney  test  (§  10.12). 


PROBLEMS 
A.  (§§10.1-10.5) 

1.  The  French  naturalist  Buffon  (1708-1788)  once  tossed  a  coin  4040  times  and 
obtained  2048  heads  and  1992  tails.  Show  that  this  result  is  not  at  all  surprising  with  a 
good  coin.  Hint:  Find  the  probability  of  a  discrepancy  from  the  expected  result  at 
least  as  great  as  this,  using  the  chi-square  test. 

2.  Over  a  period  of  time,  the  numbers  of  aircraft  accidents  that  occurred  on  the 
different  days  of  the  week  were  noted,  with  the  following  result : 


Day 

M 

Tu 

W 

Th 

F 

S 

Sun 

/o 

16 

8 

12 

11 

9 

14 

14 

Is  there  good  reason  to  doubt  that  an  accident  is  equally  likely  to  occur  on  any  day 
of  the  week? 

3.  In  the  following  table,  x  is  the  number  of  5's  or  6's  observed  in  a  throw  with 
five  dice.  Is  this  result  of  243  throws  consistent  with  the  hypothesis  that  the  dice  are 
true? 


X 

0 

1 

2 

3 

4 

5 

fo 

23 

90 

81 

30 

19 

0 

Hint:  Pool  the  last  two  classes. 

4.  A  student  dealt  26  cards  from  an  ordinary  deck  and  counted  the  number  of 
honor  cards  in  the  hand  dealt  (A,  K,  Q,  J,  and  10  counting  as  honors).  He  did  this  50 
times,  with  the  following  result : 


X 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

/o 

1 

0 

2 

3 

7 

5 

10 

8 

2 

7 

2 

0 

2 

1 

fc 

0.0 

0.2 

0.9 

2.7 

6.0 

9.6 

11.2 

9.6 

6.0 

2.7 

0.9 

0.2 

0.0 

0.0 

Would  you  reject  the  hypothesis  that  the  cards  were  well  shuffled  between  each  deal  ? 
Hint:  The  expected  frequency  of  x  honor  cards,  if  the  hypothesis  is  true,  is  50l     ) 

\26—  )/  \26/'  g*™g  t*ie  theoretical  frequencies  fc.  Pool  the  first  four  and  the  last  five 

classes. 

5.  In  an  insecticide  test,  20  insects  were  put  into  each  of  100  jars  and  subjected  to  a 
standard  dose  of  insecticide.  The  number  surviving  (x)  after  three  hours  was  counted 
for  each  jar: 


x 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

/o 

3 

8 

11 

15 

16 

14 

12 

11 

9 

1 
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Does  this  distribution  appear  to  be  binomial  ?  Hint:  If  each  insect  has  the  same  chance 
6  of  surviving,  the  distribution  will  be  binomial.  Estimate  6  from  the  relation  200  =  x, 
and  calculate  the  theoretical  frequencies.  The  first  two  and  the  last  two  frequencies 
should  be  pooled.  Note  that  the  last  theoretical  frequency  is  for  x  =  9  or  more. 

6.  Samples  of  50  balls  were  taken  repeatedly  from  a  mixture  of  100  red  balls  and 
1 100  white  ones,  and  the  number  (x)  of  red  balls  in  each  sample  was  noted.  The  balls 
were  returned  and  well  mixed  after  each  sampling.  In  300  trials  the  following  values  of 
x  were  observed. 


X 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10  or  more 

/o 

1 

16 

36 

48 

62 

51 

41 

22 

18 

5 

0 

Test  the  agreement  of  this  distribution  with  a  Poisson  distribution  of  parameter  /n  =  f$. 

7.  Calculate  x  (the  mean  of  x)  for  the  data  of  Problem  6,  and  fit  the  observations 
with  a  Poisson  distribution  of  parameter  x.  Test  the  agreement  now.  Hint:  When  /u, 
is  estimated  from  the  samples,  the  degrees  of  freedom  are  reduced  by  1. 

8.  A  classical  example  (originally  given  by  von  Bortkiewicz)  of  the  distribution  of 
rare  events  is  that  of  the  deaths  of  Prussian  cavalrymen  from  the  kicks  of  horses  during 
the  20  years  1875-1894.  The  frequency  distribution  of  such  deaths  in  10  army  corps, 
per  corps  per  annum,  was 


X 

0 

1 

2 

3 

4 

/o 

109 

65 

22 

3 

1 

Fit  a  Poisson  distribution  and  test  the  goodness  of  fit. 

9.  The  following  table  gives  a  distribution  of  lengths  of  time  (in  seconds)  of 
telephone  calls  at  a  certain  exchange : 


Time 

Number  of  Calls 

0-99 

1 

100-199 

28 

200  -  299 

88 

300  -  399 

180 

400  -  499 

247 

500  -  599 

260 

600  -  699 

133 

700  -  799 

42 

800  -  899 

11 

900  -  999 

5 

995 

The  mean  and  standard  deviation  are  477.3  sec  and  145.7  sec  respectively.  Fit  a  normal 
curve  to  the  data  and  test  the  goodness  of  fit. 

10.  In  a  study  of  plant  disease  (spotted  wilt  of  tomatoes)  the  numbers  (x)  of  diseased 
plants  were  counted  in  each  of  160  groups  of  plants.  Each  group  contained  nine  plants, 
evenly  spaced,  so  that  x  could  take  integral  values  from  0  to  9  inclusive.  The  following 
distribution  was  found : 

0  12345678 


/o 


36        48        38        23        10        3        1         1        0 


Assuming  that  the  probability  of  being  diseased  is  constant  (0),  fit  the  distribution 
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with  a  binomial,  estimating  6  from  the  sample.  Test  the  agreement  by  the  chi-square 
method,  using  three  different  procedures  for  pooling — (a)  the  last  three  frequencies, 
(b)  the  last  four,  (c)  the  last  five.  (Note  that,  in  this  example,  wider  pooling  tends  to 
disguise  departures  from  the  theoretical  distribution). 

B.  (§§  10.6-10.8) 

1.  The  cumulative  frequencies  Fo  corresponding  to  given  values  of  x  (upper  class 
boundaries)  in  a  sampling  experiment,  are  shown  in  the  following  table.  The  theoretical 
cumulative  frequencies  Fc  from  a  certain  normal  population  are  also  given : 

Upper  Class  Boundary  (x)     Fo  Fc 


20.5 

2 

0.6 

22.5 

9 

4.5 

24.5 

24 

20.9 

26.5 

73 

68.5 

28.5 

162 

161.7 

30.5 

300 

287.6 

32.5 

402 

401.3 

34.5 

469 

471.5 

36.5 

505 

500.8 

38.5 

510 

509.2 

40.5 

511 

511.0 

Use  the  Kolmogorov  test  to  show  that  it  is  reasonable  to  accept  the  hypothesis  that  the 
population  sampled  had  the  normal  distribution  corresponding  to  the  column  Fc. 

2.  The  following  table  gives  cumulative  frequencies  of  correct  responses  to  a  psycho- 
logical test  for  (a)  a  group  of  24  normal  subjects,  (b)  a  group  of  24  schizophrenic 
subjects.  The  test  required  the  perception  of  groupings  in  a  design  exposed  to  the 
subject  for  a  variable  time  (Kaswan,  British  Journ.  Psych.,  1958,  p.  131). 


sure  Time 

F(Normal) 

F(Schizophrenic) 

0.01  sec 

4 

2 

0.04 

10 

4 

0.1 

13 

5 

0.25 

17 

10 

0.75 

21 

16 

5.0 

24 

21 

0.0 

24 

24 

Use  the  Kolmogorov-Smirnov  two-sample  test  to  determine  whether  there  is  a  significant 
difference  between  the  two  groups. 

3.  Can  the  following  sample  be  reasonably  regarded  as  coming  from  a  uniform 
distribution  on  the  interval  (35,  70):  36,  42,  44,  50,  64,  58,  56,  50,  37,  48,  52,  63, 
57,  43,  39,  42,  47,  61,  53,  58?  Use  the  Kolmogorov  test.  Hint:  Calculate  theoretical 
relative  cumulative  frequencies,  x/35,  at  the  given  values  of  x. 

4.  Are  the  following  observations  (on  percentage  change  in  systolic  pressure  in 
dogs  under  experimental  conditions)  consistent  with  a  normal  distribution  of  mean 
16.0  and  variance  30.0? 

20.6,     11.6,      7.5,     10.5,     13.9,     16.2,     14.8 
17.8,    26.9,     13.5,    20.1,    22.5,     11.1,     16.7 

5.  Since  F(x)  in  the  Kolmogorov  test  is  assumed  to  be  continuous,  there  is  zero 
probability  of  two  identical  values  of  x.  However,  since  measurements  are  made  with 
limited  accuracy,  ties  do  occur.  What  would  be  in  general  the  effect  of  this  on  the  value 
of  DnI 
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C.  (§§  10.9-10.12) 

1.  The  observed  values  xi  and  X2  for  two  paired  samples  are  given.  Is  it  reasonable 
to  assume  that  no  difference  exists  between  the  medians  of  the  two  populations  from 
which  these  samples  were  taken?  Use  the  sign  test. 


Xl 

15 

19 

31 

36 

10 

11 

19 

15 

10 

16 

X2 

19 

30 

26 

8 

10 

6 

17 

13 

22 

8 

2.  For  nine  animals,  tested  under  control  conditions  and  experimental  conditions, 
the  following  values  of  a  measured  variable  were  observed : 


Animal  No. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Control 

21 

24 

26 

32 

55 

82 

46 

55 

88 

Experimental 

18 

9 

23 

26 

82 

199 

42 

30 

62 

Test  whether  a  significant  difference  exists  between  the  medians,  using  the  Wilcoxon 
signed-ranks  test. 

3.  The  following  table  gives  scores  of  a  group  of  engineering  students  in  (a)  mathe- 
matics, (b)  graphics.  Use  the  signed-ranks  test  to  determine  whether  the  median  scores 
of  such  students  differ  significantly  in  the  two  subjects. 


Student  No. 

Maths 

Graphics 

Sudent  No. 

Maths 

Graphics 

1 

66 

51 

16 

86 

72 

2 

20 

51 

17 

65 

73 

3 

66 

45 

18 

33 

69 

4 

73 

77 

19 

42 

51 

5 

59 

68 

20 

66 

66 

6 

58 

51 

21 

59 

82 

7 

37 

50 

22 

57 

74 

8 

85 

81 

23 

27 

44 

9 

57 

66 

24 

55 

65 

10 

69 

65 

25 

61 

65 

11 

63 

54 

26 

86 

71 

12 

75 

53 

27 

52 

65 

13 

87 

73 

28 

79 

62 

14 

34 

40 

29 

63 

62 

15 

67 

73 

30 

75 

64 

4.  In  a  comparative  study  of  the  effects  of  oxygen  on  the  peripheral  nerve  in  cats 
and  rabbits,  the  survival  time  of  a  nerve  under  anoxic  conditions  was  measured.  The 
times  (in  minutes)  are  given: 


Cat  No. 


Time 


33 


25 


Rabbit  No. 

123456789     10     11 

12     13     14 

Time 

35     35     30    30    28     28     23     22    22    20     17 

16     16     15 
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Test  the  hypothesis  that  these  are  random  samples  from  populations  with  the  same 
distribution,  as  against  the  alternative  hypothesis  that  the  cat  times  are  stochastically 
larger  than  the  rabbit  times. 

5.  In  the  following  table  the  variable  is  the  number  of  trials  required  by  a  rat  to 
learn  a  new  pattern  of  behavior  when  placed  in  a  new  situation.  The  experimental 
group  of  rats  had  been  trained  in  a  certain  way;  the  control  group  had  not.  Test 
whether  the  previous  training  significantly  affects  the  ability  to  learn. 


Experimental  Group 

78 

64 

75 

45 

82 

54   71 

Control  Group 

110 

70 

53 

51 

62 

93  106 

88 

67 

72 

D.  (§§  10.13-10.15) 

1.  The  annual  marriage  rates  per  1000  of  population  in  the  United  States  for  1885, 
1890,  1895  . .  .  1950  are:  9.2,  9.0,  8.9,  9.3,  10.0,  10.3,  10.0,  12.0,  10.3,  9.2,  10.4, 
12.1,  12.2,  11.1.  Does  there  appear  to  be  a  significant  upward  trend  ?  (Use  the  number 
of  positive  differences  and  also  the  total  number  of  runs.) 

2.  Apply  the  approximate  chi-square  test  on  runs  of  a  given  length  to  test  the  ran- 
domness of  the  time  series  in  Problem  D-l. 

3.  A  student  opened  a  set  of  mathematical  tables  with  the  entries  blocked  off  in 
sets  of  five,  and,  starting  at  random,  added  the  five  terminal  digits  in  each  block  of  five 
numbers.  Going  consecutively  through  50  blocks,  he  obtained  the  following  values : 

12,  15,  18,  30,  33,  25,  28,  22,  23,  17 

25,  18,  22,  13,  17,  18,  22,  25,  27,  30 

28,  32,  24,  27,  20,  22,  15,  18,  20,  23 

12,  15,  27,  30,  33,  25,  28,  20,  23,  17 

25,  18,  20,  13,  17,  18,  22,  33,  27,  30 

Test  the  sequence  for  randomness,  using  runs  up  and  down. 

4.  Snedecor  {Statistical  Methods,  Iowa  State  College  Press,  1956)  has  given  the 
following  table  showing  the  amounts  of  four  different  fats  (in  grams)  absorbed  by 
doughnuts,  six  batches  of  doughnuts  being  used  for  each  fat. 


Fat 


A 

B 

C 

D 

164 

178 

175 

155 

172 

191 

193 

166 

168 

197 

178 

149 

Ml 

182 

171 

164 

195 

177 

176 

168 

156 

185 

163 

170 

Test  for  a  significant  difference  between  the  means  by  Jonckheere's  method. 
Hint:  First  order  the  samples  in  increasing  order  of  their  means.  Count  I  for  ties  in 
computing  pa.  Use  the  normal  approximation. 

5.  The  following  table  is  supposed  to  represent  the  scores  of  samples  from  three 
different  groups  of  teachers  on  a  certain  personality  test.  Does  there  seem  to  be  a  real 
ordering  of  the  groups  ? 
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Group  I 

Group  II 

Group  III 

96 

82 

115 

128 

124 

149 

83 

132 

166 

61 

135 

147 

101 

109 

129 
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Chapter  11 

DISTRIBUTIONS  OF   PAIRS   OF  VARIATES 

11.1  The  Classical  Regression  Problem  for  a  Population  We  now  propose  to 
investigate  some  measures  of  the  relationship  between  two  random  variables 
(variates),  X  and  Y,  both  of  which  are  capable  of  measurement  on  each  member 
of  a  given  population.  For  convenience  we  shall  think  of  them  as  distributed 
continuously,  but  the  extension  to  discrete  distributions  will  usually  be  obvious 
— a  matter  of  replacing  probability  densities  by  probabilities  and  integrals  by 
sums. 

In  general,  the  joint  probability  that  for  a  particular  member  of  the  population 
the  value  of  X  lies  between  x  and  x  +  dx  and  the  value  of  Y  lies  between  y  and 
y  +  dy  is  given  byf(x,  y)  dx  dy.  The  probability  density  for  X  alone  (regardless 
of  Y)  is 


(ll.l.l)  g(x)-= 

and  that  for  Y  alone  is 


f(x,  y)  dy 


r  oo 

J  —• 


(11.1.2)  Ky)=\      f{x,y)dx 

J   —  00 

The  probability  density  of  Y,  for  a  given  value  x  of  X,  is 

(11.1.3)  f(y\x)  =/(*,  y)/g(x) 

and  similarly  for/(;c|j>).  The  expectation  of  Y,  for  given  X,  is  defined  as 

'oo 

(11.1.4)  r,x  =  E(Y\X)=        yf(y\x)dy 

J  -co 

It  is,  of  course,  a  function  of  the  given  value  x  of  X.  The  graph  of  nx  as  a 
function  of  x  is  called  the  true  regression  curve  of  Y  on  X.  If  the  regression  is 
linear  the  graph  is  a  straight  line,  with  the  equation 

(11.1.5)  nx  =  ot+px 

The  parameter  /?  is  the  true  regression  coefficient  of  Y  on  X,  and  is  the  slope  of 
the  straight  line.  The  other  parameter  a  represents  the  intercept  on  the  axis  of  Y. 
(See  Fig.  47). 

The  variance  of  F,  for  given  X,  is  similarly  defined  by 


(11.1.6)  onx2  =  V(Y\X)  = 


(y  ~  rix)2f(y\x)  dy 

00 

»79 
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This  is  also  in  general  a  function  of  x,  although  in  some  circumstances  it  turns 
out  to  be  independent  of  x.  It  is  the  variance  of  Y  for  those  members  of  the 
population  whose  X-values  lie  in  a  thin  strip  of  width  dx — these  members  are 
said  to  form  an  X-array.  They  are  represented  by  dots  in  Figure  47. 


rjx  =  a+0x 


Fig.  47    Linear  regression  in  a  population 

The  regression  parameters  a  and  /?  may  be  expressed  in  terms  of  the  moments 
of  the  distributions  of  X  and  Y.   Let  us  define  the  two  means  by 

'  poo  /*00       /»00 

Vx  =         d(x)x  dx  =  xf(x,  y)  dy  dx 

J  —  oo  J  —  oo  J  —  oo 


(11.1.7) 


f 


Hy  =  I     h(y)y  dy  = 

K  J    —  00 

and  the  two  variances  and  the  covariance  by 

Hx)2  dx 


P  yf(x, 

J  —  oo 


y)  dx  dy 


(11.1.8) 


f*co 

g(x)(x 

J  —oo 

-f 

^oo      /*oo 

=      /(x' 

J  —  co  J  —  oo 


h(y)(y  -  Vy)2  dy 


<*xy  =  I       I      /I**  )0(*  -  MxXy  -  /^y)  dy  dx 

^  J  -co J  —oo 

From  Eqs.  (3)  and  (4)  it  follows  that 
(H.1.9)  0(X)1X=\      yf(x,y)dy 

J  -co 

so  that,  on  writing  rjx  =  a  +px  and  integrating, 

/*oo  /•oo      /*oo 

(11.1.10)  (a  +  0x)0(x)  <fe  =  y/(x,  .y)  ^  dx 

J  —  00  J  —  oo  J  —  00 
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or,  using  Eq.  (7), 

(11.1.11)  a+pfix=^Y 

If  we  multiply  Eq.  (9)  by  x  before  integrating,  we  obtain 

(11.1.12)  (ax  +  px2)g(x)  dx  =  xyf(x,  y)  dy  dx 

J  —oo  J  —  oo  J  —  oo 

which  may  be  written 

(11.1.13)  ocfix+Pfi'2X=nXY 

where  \jl  2X  is  the  second  moment  about  the  origin  for  X,  and  nXY  is  the  product 
moment  about  the  origin  for  X  and  Y.  Multiplying  Eq.  (1 1)  by  \ix  and  subtract- 
ing it  from  Eq.  (13),  we  obtain 

Wix  ~  Vx2)  =  nxr  ~  VxVy 
which  is  equivalent  to 

(11.1.14)  P'°x2=Gxy 

We  have,  therefore, 

(11.1.15)  P  =  OxyIox1  =  Pxy<*yIcx 

where  pXY  =  ^xyI^x^y)^  the  Pearson  coefficient  of  correlation  between  X  and  Y. 
From  Eq.  (11), 

(11.1.16)  a=nr-Pnx 

so  that  the  two  parameters  of  the  regression  line  are  now  expressed  in  terms  of 
the  means,  variances  and  covariance  of  X  and  Y.  Using  these  expressions,  the 
equation  of  the  line  may  be  written 

(11.1.17)  rlx-fiY=P(x-fix) 

x-  flx 


=  PxYaY 


Ox 


which  indicates  that  the  line  passes  through  the  point  with  coordinates  (fix,  fiY). 
A  similar  equation  may  be  obtained  by  interchanging  X  and  Y.  If  £y  is  the 
expectation  of  X,  for  a  given  value  y  of  Y, 

(11.1.18)  iy^x==Pxy(Txl—tl 

oY 

This  line  also  passes  through  the  point  (nx,  fxY)  but  in  general  it  does  not 
coincide  with  the  first  line.  There  are  therefore  two  regression  lines,  one  of  Y 
on  X  and  one  of  X  on  Y,  given  respectively  by  Eqs.  (17)  and  (18).  The  first  line 
represents  the  expectation  of  Y  for  a  given  X,  and  the  second  line  the  expec- 
tation of  X  for  a  given  Y. 
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The  variance  of  F,  for  a  given  X,  may  vary  from  one  value  of  X  to  another, 
but  we  can  define  a  weighted  average  of  the  variances  in  the  different  Z-arrays 
by  the  relation 


(11.1.19) 


<*Ye     = 


<TY\x2g(x)  dx 


the  variance  oY\x2  for  a  given  x  being  weighted  with  the  probability  density  for 
this  value  of  x.  The  quantity  oY2  is  called  the  variance  of  estimate  of  Y.  It  is  a 
measure  of  the  average  variability  of  Y  around  the  regression  line  of  Y  on  X. 
Using  Eqs.  (3)  and  (6)  and  noting  that 

(y  -  nx)2  =  {y-vy-  P(x  -  tix)}2 

=  (y-  fiy)2  +  P2(*  ~  »x)2  ~  2p(y  -  fiY)(x  -  iix) 
we  obtain,  from  Eq.  (19), 


(11.1.20) 


l(y  -  fly)2  +  P2(*  -  fix)2 

-2P(y-  nY)(x  -  fix)]f(x,  y)  dx  dy 


=  <TY2+P2ax2-2pax: 


rXY   l°X 

_      2 


P<TXY 

2/ 


Pxy2°y2>  so  tnat 

2> 


From  Eq.  (15),  p2ax2 

(11.1.21)  oy/=gy\\-Px/) 

It  is  clear  from  this  relation,  since  oY2  and  oY2  are  necessarily  non-negative, 
that  pXY2  cannot  exceed  1 .  In  other  words,  for  all  possible  distributions  of  A'  and 


(11.1.22) 


-  1  <  Pxy  <  1 


As  previously  indicated  (§  2. 13),  pXY  is  a  measure  of  the  degree  of  relationship 
between  A'and  Y.  If  A"  and  Y  are  independent,  pXY  =  0.  If  Y  is  precisely  pro- 
portional to  X,  so  that  all  the  Y  values  lie  on  the  straight  regression  line,  then 


Ye 


0  and  pXY  =  ±1.   These  are  the  extreme  cases. 


11.2  The  Bivariate  Normal  Surface    An  important  special  case  of  a  two- 
variate  distribution  is  that  with  a  joint  probability  density: 


(11.2.1) 

where 

(11.2.2) 

and 

(11.2.3) 


f(x,  y)  =  Ke-e 
K~l  =2noxoY{\  -  p2)1/2 


\     ax     J         \     oY     I  \     ax     l\     <Ty     / 

+  [2(1  -  p2)] 
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Here  p  has  been  written  for  pXY.   The  quantity  Q  is  a  quadratic  form  in  the 
standardized  variates, 


(11.2.4) 


z  = ,  V  = 


ax  oY 

so  that  the  probability  density  in  terms  of  the  variates  z  and  v  is 

•2-2pzi; 


(11.2.5) 


-i 


3(z,t>)  =  (27i)-'(l-pi)"'"exp 


1        2(1  - 


P2) 


This  represents  a  surface  known  as  the  bivariate  normal  surface,  and  pictured 
(in  a  truncated  form)  in  Figure  48.  It  is  bell-shaped,  asymptotic  in  all  directions 


Fig.  48    Bivariate  normal  surface 

to  the  z-v  plane.   Sections  parallel  to  the  z-v  plane  are  ellipses  and  sections 
parallel  to  either  the  g-z  plane  or  the  g-v  plane  are  normal  curves. 

The  distributions  of  z  and  v  separately  are  given  by  integrating  Eq.  (5).  Thus, 


(11.2.6) 


(11.2.7) 


m  = 


g(z,v)dv=(2ny1/2e-z2/2 

-  00 

h(v)  =  |       g(z,  v)  dz  =  (2nyl/2e-v2/2 

J  —  oo 


These  are  both  standard  normal  distributions.   The  probability  density  for  v, 
for  a  given  z,  is 

«„,,,  ,(„|21.^).W1_(,.)]-.,^[.|^)] 
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The  expectation  of  v  for  a  given  z  is  therefore  • 


11.2 


vg(v\ 

J  -co 


z)  dv 


pz 


This  represents  a  straight  line  through  the  origin  with  slope  p.  The  regression  is 
therefore  linear,  and  this  holds  also  for  the  second  regression  line,  with  equation 

Each  array  of  v's  has,  for  given  z,  the  variance 


(v  —  pz)2g(v\z)  dv 


and,  on  carrying  out  the  integration,  this  becomes 


(11.2.9) 


^iz2  =  1  ~  P: 


Fig.  49    Horizontal  section  of  bivariate  normal  surface 


Transforming  back  to  the  original  variates  X  and  F,  we  obtain 


(11.2.10) 


Y\x 


*y2(l  -  P2) 


and  the  variance  is  therefore  independent  of  x.  The  weighted  average  oYe2  is 
then  the  same  as  o-yi*2,  for  all  x.  A  distribution  with  this  property  is  called 
"homoscedastic"  (from  Greek  words  meaning  "equal  scattering").  A  similar 
property  holds  for  the  ^-arrays  of  X  for  given  values  of  y. 

In  the  z-v  plane  the  two  regression  lines  have  the  same  slope,  one  with 
respect  to  Oz  and  the  other  with  respect  to  Ov.  (Figure  49).   A  section  of  the 
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bivariate  normal  surface  by  a  horizontal  plane,  g  =  const.,  is  an  ellipse  with 
equation 

(11.2.11)  z2  +  v2  -2pzv  =  c 

If  this  ellipse  is  drawn  on  the  z-v  plane,  the  tangents  at  the  points  where  it  is 
cut  by  the  regression  lines  are  parallel  to  the  axes. 

The  bivariate  normal  distribution  occupies  a  central  position  in  the  theory 
of  two-variate  distributions,  similar  to  that  of  the  normal  distribution  for  a  single 
variate.  Most  of  the  early  classical  work  of  Karl  Pearson,  Galton  and  others,  on 
regression  and  correlation,  was  based  on  this  distribution. 

11.3  Linear  Regression  as  determined  from  a  Sample  In  practice,  there  is  a 
variety  of  situations  involving  two  variables.  Both  X  and  Y  may  be  random,  or 
only  one  of  them,  or  neither.  One  variable,  for  instance,  may  be  the  time,  as  in  a 
time-series  of  temperatures,  stock-market  prices,  sunspot  numbers,  etc.  Also,  in 
many  cases,  the  values  of  X  are  pre-selected  instead  of  being  chosen  at  random, 
as  in  a  physics  experiment  where  conveniently  chosen  weights  are  hung  on  a 
wire  or  spring  and  the  extension  produced  is  measured.  Here  neither  X  nor  Y 
is  a  random  variable  in  the  ordinary  sense,  but  both  may  be  subject  in  different 
degrees  to  experimental  error.  The  true  relation  between  A^and  7  is  afunctional 
one.  Obviously,  unless  X  is  a  random  variable  it  makes  no  sense  to  speak  of  its 
distribution,  expectation  or  variance,  and  of  course  the  same  holds  for  Y. 

In  the  usual  regression  problem,  Y  is  a  random  variable  and  X  may  be  either 
random  or  fixed.  The  classical  situation  is  that  in  which  both  are  random,  a 
sample  being  selected  randomly  from  a  population  such  as  that  considered  in 
§  11.1,  and  the  values  of  X  and  Y  measured  on  each  of  the  N  selected  items.  We 
assume  that  for  a  given  value  x  of  X,  Y  is  of  the  form 

(11.3.1)  y=a+^x+e 

where  e  is  normally  distributed  with  mean  0  and  variance  a2.  The  relation 

(11.3.2)  Y=a+pX 
which  really  means 

(11.3.3)  E(Y\X  =  x)  =  a  +  fix 

is  sometimes  called  a  structural  relation.  It  is  the  underlying  relationship  which 
is  disturbed,  in  an  actual  sample,  by  the  sampling  fluctuation  of  Y  and  by  the 
errors  of  measurement  of  both  X  and  Y. 

We  will  asume  for  the  present  that  X  and  Y  are  measured  without  appre- 
ciable error,  so  that  we  need  consider  only  the  sampling  fluctuation  expressed  by 
the  quantity  e  in  Eq.  (1).  If  a  straight  line  is  fitted  to  the  N  sample  values  of  X 
and  7,  this  line  will  furnish  estimates  of  the  parameters  a  and  /?  of  the  true 
regression  line.  In  fact,  as  we  shall  see  shortly,  when  the  line  is  fitted  by  the 
usual  "least  squares"  method  these  estimates  are  unbiased. 
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Suppose  the  observed  pairs  of  sample  values  are  (xh  yt),  i  —  1,  2  ...  N.  The 
problem  is  to  find  the  equation  of  a  straight  line  which  will  give  the  "best" 
estimate  of  Y  for  a  given  value  x  of  X  and  to  find  the  standard  error  of  this 
estimate. 

The  observed  sample  values,  plotted  in  the  x-y  plane,  form  a  scatter 
diagram  (Fig.  50).  The  least-squares  method  of  fitting  a  straight  line  chooses 
the  constants  a  and  b  of  the  equation 


(11.3.4) 


yc  =  a  +  bx 


in  such  a  way  that  the  sum  of  squares  of  the  deviations  of  the  sample  points  from 
this  line,  measured  parallel  to  the  j-axis,  will  be  a  minimum.  In  Figure  50  the 


<%-  ri=a+&x 


Fig.  50    Sample  linear  regression 

deviation  of  yt  from  the  least-squares  line  is  denoted  by  et,  and  the  deviation  from 
the  true  regression  line  rj  =  a  +  fix  by  ef. 

If  S  =  Y^i  et2  =  Zi  (yt  ~  a  —  bXi)2,  the  minimum  value  of  £  will  be  given  by 

solving  the  simultaneous  equations  —  —  0,  —  =  0.    These  may  be  written, 


da 


db 


after  cancelling  a  factor  —  2, 
(11.3.5) 


X  (Jt  -  a  -  bx,)  =  0 

i 

Z  xi(y  - a  -  bxi)  =  ° 


(11.3.6) 


which  are  called  the  normal  equations  of  the  problem.  They  can  be  rearranged  as 

£  x(a  +  £  xt2b  =  £  xtf, 
Solving  for  a  and  6,  we  obtain 
(11.3.7)  b=—2 


(11.3.8) 


a  =  y  —  bx 
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where  3c  =  £  xJN,  y  =  £>,/#,  (N  -  \)sx2  =  £  xt2  -  Nx2y  and  (TV  -  \)sXY 
=  J]  x,^,  —  Nxy.  Here  sx2  is  the  sample  variance  of  X  and  sXY  is  the  sample 
covariance  of  X  and  F. 

The  Pearson  coefficient  of  correlation  for  the  sample  is  defined  by 

(11.3.9)  rXY=^  =  bS-Z 

SXSy  Sy 

so  that  the  equation  of  the  least-squares  line  may  be  written,  using  Eqs.  (7)  and 

(8),  as 

(11.3.10)  yc-  y  =  b(x-x)=  —  (x-x) 

*x 

where  r  stands  for  rXY. 

For  any  given  value  x  of  X,  yc  provides  an  estimate  of  the  corresponding  value 

of  Y. 

All  the  above  argument  can  be  carried  through  with  X  and  Y  interchanged. 
If  we  wish  to  find  the  best  straight  line  for  estimating  Xfor  a  given  value  y  of  Y, 
the  least  squares  criterion  for  the  deviations  of  the  sample  points  from  this  line, 
measured  parallel  to  the  x-axis,  will  give 

(11.3.11)  -     Xc  -  X  =  b'(y  -y)=r-^(y-y) 

SY 

where 

(11.3.12)  b'  =^| 

sY 

It  may  be  noted  that  bb'  =  r2,  so  that  r  is  the  geometric  mean  of  the  two  slopes 
(one  measured  from  the  x-axis,  one  from  the  j>-axis).  The  two  regression  lines 
intersect  at  the  point  whose  coordinates  are  (3c,  y). 

In  most  practical  situations  one  of  the  two  variates  will,  for  non-statistical 
reasons,  be  the  one  we  would  like  to  estimate,  and  this  is  the  one  we  label  Y.  It 
is  therefore  hardly  necessary  to  treat  the  second  regression  line  in  detail.  What 
is  said  in  subsequent  paragraphs  about  the  regression  of  Y  on  X  can  be  applied, 
with  minor  changes,  to  the  regression  of  X  on  Y. 

11.4  Computation  of  the  Regression  and  Correlation  Coefficients  Calcu- 
lation of  b  and  r  requires,  in  effect,  the  determination  of  two  variances  and  a 
covariance,  since  b  =  sXYjsx2  and  r  =  sXY/(sxsY).  The  most  convenient  formulas 
for  ungrouped  variates  are  the  following : 

(11A1)  b=  nj>*-g*)* 

(NYxy-Yx-Y  y)2 
(11.4.2)  r2-  \     ^    y      L*     laj) 


[N£*2-(X*)2][N£v2-(Zy)2] 
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If,  for  the  sake  of  simplifying  the  calculations,  we  put  u  =  (x  -  x0)/h  and 
v  =  (y  —  y0)lk,  where  x0,  y0,  h  and  k  are  chosen  arbitrarily,  it  will  make  no 
difference  to  the  value  of  r  if  we  replace  x  and  y  by  u  and  v  throughout  Eq.  (2). 
The  value  of  b,  however,  obtained  by  using  u  and  v  in  Eq.  (1)  must  be  multiplied 
by  k/h  to  give  the  value  in  terms  of  x  and  y. 

Example  1  Specimens  of  steels  containing  various  percentages  of  nickel 
were  tested  for  toughness  with  the  following  results  (X  is  toughness  in  arbitrary 
units,  Y  is  percentage  of  nickel) : 


X 


47  50  52  52  54  56  58  59  60  60  62  64  65  66 


Y     |  2.5  2.7  2.8  2.8  2.9  3.2  3.2  3.3  3.4  3.5  3.5  3.6  3.7  3.8 

Suppose  it  is  desired  to  estimate  percentage  of  nickel  from  measured  toughness 
in  further  specimens  and  to  estimate  the  correlation  between  these  variables. 
It  is  assumed  that  a  random  sample  of  nickel-steel  alloys  was  selected  for 
testing,  and  both  X  and  Y  were  measured  on  each  specimen. 

If  we  let  u  =  X  -  50  and  v  =  10(7  -  30),  we  get  Table  11.1.  Then, 

Table  11.1 


u 

V 

w2 

V2 

uv 

-3 

-5 

9 

25 

15 

0 

-3 

0 

9 

0 

2 

-2 

4 

4 

-4 

2 

-2 

4 

4 

-4 

4 

-1 

16 

1 

-4 

6 

2 

36 

4 

12 

8 

2 

64 

4 

16 

9 

3 

81 

9 

27 

10 

4 

100 

16 

40 

10 

5 

100 

25 

50 

12 

5 

144 

25 

60 

14 

6 

196 

36 

84 

15 

7 

225 

49 

105 

16 

8 

256 

64 

128 

105 

29 

1235 

275 

525 

NZMl?-ZM2>  =  14(525)  -  105(29)  =  4305 
N  Z  "2  -  (E  w)2  =  14(1235)  -  (105) 2  =  6265 
N  Z  "2  -  (Z  *02      =  14(275)  -  (29)2      =  3009 
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Since  h  =  1  and  k  =  0.1,  we  have 

4305 
6=0.1— —=0.0687 
6265 


r2 

6265  x  3009 

=  0.9831 

r 

=  0.9915 

X 

«*      105 
=  50+H 

=  57.5 

y 

=  3.0+0.1 

29 
14 

=  3.207 

so  that  for  any  given  x 

yc  =  3.207  +  0.0687(x  -  57.5) 

This  is  the  relation  required. 

Example  2  When  the  sample  to  be  studied  is  large,  it  is  usually  con- 
venient to  replace  the  scatter  diagram  by  a  two-way  frequency  table,  the  fre- 
quency in  any  cell  of  the  table  being  the  number  of  individuals  in  the  sample 
falling  within  the  corresponding  class  intervals  for  both  X  and  Y.  In  Table 
1 1.2,  ^represents  the  grade  achieved  on  a  mental  test  by  applicants  for  a  certain 
type  of  industrial  job,  and  Y  the  productive  ability  of  these  applicants  after 
hiring  (measured  as  a  percentage  of  a  certain  standard  of  production).  The 
auxiliary  variables  u  and  v  are  here  defined  as  u  —  (x  —  42.5)/5,  v  —  (y  —  85)/10. 
The  values  of  x  and  y  shown  in  the  table  headings  are  the  centers  of  the  class- 
intervals.  The  marginal  column  totals  are  denoted  by  fu  (the  frequency  for  a 
given  value  of  u),  and  the  marginal  row  totals  are  similarly  denoted  by  fv. 
Clearly, 

!fu  =  lfv  =  N  =260 

U  V 

which  is  also  the  sum  of  all  the  frequencies  in  all  the  cells  in  the  main  body  of  the 
table  (the  part  surrounded  by  a  double  line).  The  values  of/„  are  added  vertically 
and  the  values  of /„  horizontally. 

The  rows  headed  ufu  and  u2fu  are  obtained  by  multiplying  each  fu  by  the 
corresponding  u  and  then  multiplying  the  product  again  by  u.  Similarly  for  the 
columns  headed  vfv  and  v2fv.  These  rows  and  columns  give  the  means  and 
variances  of  u  and  v. 

"-=^=iS=-°-06538 

-£*-S  =°-23077 
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(K-*-.I*-ffi#E-A-5 


733.89 


(N-l)sv2=Zv2fv- 


(2>/»): 


N 


802- 


3600 
260 


=  788.15 


The  means  and  variances  of  X  and  Y  are 

5c  =  42.5  +  5m  =  42.2 
y=S5  +  10v   =87.3 


sx 


25su2 
100s,,2 


=  70.84 
=  304.3 


There  are  various  methods  in  use  for  obtaining  the  covariance  suv.  One  method, 
which  has  the  advantage  of  providing  convenient  checks  on  the  calculations,  is 


Table  11.2 


X 

22.5 

27.5 

32.5 

37.5 

42.5 

47.5 

52.5 

57.5 

y 

-4 

-3 

-2 

-1 

0 

1 

2 

3 

fv 

Vfv 

v*fv 

U 

vU 

125 

4 

2 

3 

2 

7 

28 

112 

1 

28 

115 

3 

1 

3 

1 

4 

4 

4 

17 

51 

153 

19 

57 

105 

2 

5 

7 

8 

11 

8 

7 

46 

92 

184 

31 

62 

95 

1 

2 

1 

10 

12 

9 

8 

2 

44 

44 

44 

13 

13 

85 

0 

1 

3 

12 

11 

7 

12 

7 

1 

54 

0 

0 

-19 

0 

75 

-1 

2 

1 

5 

6 

16 

8 

5 

43 

-43 

43 

-9 

9 

65 

-2 

2 

5 

5 

8 

8 

6 

1 

35 

-70 

140 

-33 

66 

55 

-3 

2 

3 

3 

4 

1 

1 

14 

-42 

126 

-26 

78 

fu 
Ufu 

i 
7 

14 

32 

49 

55 

54 

35 

14 

260 

60 

802 

-17 

313 

-28 

-42 

-64 

-49 

0 

54 

70 

42 

-17 

1   Checks  / 

uHu 

112 

126 

128 

49 

0 

54 

140 

126 

735 

V 

-12 

-18 

-10 

-1 

4 

32 

37 

28 

60 

uV 

48 

54 

20 

1 

0 

32 

74 

84 

313 

■ 

11.5  DISTRIBUTIONS  OF  PAIRS  OF  VARIATES  291 

indicated  in  the  table.  The  row  headed  V  is  obtained  by  multiplying  each  cell- 
frequency  in  a  given  column  by  the  value  of  v  corresponding  to  that  cell  and 
adding  along  the  column.  Thus,  for  the  first  column,  V  =  1(0)  +  2(—  1) 
+  2(-2)  +  2(-3)  =  - 12.  Each  value  of  V  is  then  multiplied  by  the  corres- 
ponding u  to  give  the  row  headed  uV. 

A  similar  procedure  is  used  for  the  columns  U  and  vU.  For  the  first  row 
U  =  2(0)  +  3(1)  +  2(2)  =  7,  and  for  this  row  v  =  4,  so  that  vU  =  28.  The 
checks  are 

1^=1^=60 

Zf=E"/u=-17 
£Ml/=£t;l/  =  313 

From  the  method  of  calculation  it  is  clear  that  ]T  uV  or  £  vU  gives  the  same 
result  as  we  might  have  obtained  by  multiplying  each  individual  cell-frequency 
by  its  own  u  and  v,  and  adding  over  all  the  cells  of  the  table.  That  is,  it  gives  the 
quantity  Y,fuv  which  we  need  in  calculating  the  co variance.  In  fact, 

5X 


N 


(N  -  IK,  =  2>K  -(!«/„) 

=  313-(-17)^ 

=  316.92 

so  that  sXY  =  50suv  =  61.18. 

The  regression  line  of  Y  on  X  is  given  by  yc  —  y  =  b(x  —  3c),  where 
b  =  61.18/70.84  =  0.864. 

The  coefficient  of  correlation  between  X  and  Y  is  given  by 

2_       61.18 2 
T   ~  (70.84)(304.3) 

=  0.1737 
so  that 

r  =  0.417 

11.5  Variance  about  the  Regression  Line  When  the  values  of  a  and  b 
are  chosen  according  to  the  equations  of  (11.3.5),  the  minimum  value  of  S  is 
given  by 

(11.5.1)  Smin  =  X  e*  =  X  [vf  -  y  -  b{xt  -  xj]2 


=  S(yi-^)2  +  ^2Z(xi-x)2 

i  i 

-2bli(x,-x)(yi-y) 


=  (N-l)(Sy2  +  6V-2fc%r) 
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where  sx2,  sY2,  and  sXY  are  the  sample  variances  of  X  and  Y  and  the  sample 
co variance,  respectively.  Using  Eq.  (11.3.9),  we  may  write  this 

(11.5.2)  Smin=(N-l)sy2(l-r2) 

which  may  be  compared  with  (11.1.21)  and  suggests  that  SmiJ(N  -  1)  is  an 
estimator  of  aYe2.  It  turns  out,  however,  that  a  better  (unbiased)  estimator  is 
SmJ(N-2). 

Corresponding  to  the  sample  value  xh  we  have  three  values  of  7,  namely,  the 
actual  sample  value  yh  the  estimated  value  yci  and  the  true  expected  value  rjit 
these  being  connected  by  the  relations 

(11.5.3)  yt  =rji+ei=oc+  fixt  +  et 

=  yCi  +  et  =  a  +  bxt  +  et 
Now 

(N-l)sXY  =  YJ(yi-y)(xi-x) 

i 

=  Z  yfa  -  x) 

i 

since  Zi  (**  -  x)  =  0.  Writing  yx  —  a  +  pxt  +  et  =  a  +  P(x(  -  x)  +  p x  +  ei9 
we  obtain  (N  —  l)sXY  =  /?  Z*  (xt  —  *)2  +  Z  e£xi  ~~  *)  s^nce  tne  other  terms 
vanish.     Therefore    (N  —  l)sXY  =  p(N  —  l)sx2  +  £  e^  —  x)    so   that,    on 

dividing  by  (AT  -  I)*/,  ^|  =  P  +  Z  «»(**  -  x)/[(N  -  l)sx2].  We  have  then 
sx 

(11.5.4)  b=P  +  eb 
where 

(1L5-5)  e'=?d^Tv 

which  may  be  regarded  as  the  "error"  or  sampling  fluctuation  of  b.  Since  we 
assumed  that  isfo)  =  0  for  all  /,  it  follows  that  E(eb)  =  0  and  therefore  b  is  an 
unbiased  estimator  of  p.  We  have  also  assumed  that  the  variance  of  ef  is  the  same 
(a2)  for  all  i — in  other  words,  that  Y  is  homoscedastic.  If  so,  we  can  write 

(11.5.6)  K(&)  =  KK)  =  ^I(-i^|; 


(N  -  l)sx 


Moreover,  if  et  is  normally  distributed  for  each  /,  so  is  b,  with  expectation  p  and 
variance  o2l\(N  -  \)sx2~\. 
From  Eq.  (3), 

(11.5.7)  ei-ei=(a-aL)+(b-P)xi 
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Also,  from  the  first  equation  of  (1 1.3.5),  £  e{  =  0,  so  that,  on  summing  Eq. 
(7)  over  i  and  dividing  by  N9  we  obtain 

(11.5.8)  a-<x  +  (b-p)x=l 
From  Eqs.  (7)  and  (4), 

(11.5.9)  si-ei  =  e  +  eb(xt  -  x) 
Now,  for  any  fixed  value  x, 

yc  —  ri=a  —  a+(b  —  p>)x 
=  8  +  eb(x  -  x) 


_       ri     (x  -  gfa  -  g] 
-2te>[N+     {N-l)sx2    J" 


Since  the  ef  are  independent,  with  mean  0  and  common  variance  <r  ,  it  follows 
that 

(11.5.10)  E{yc)  =  n 

and 


(u.5.11)  nyc)  =  ^2Z  TT  + 


r  In 


(x-x)(x,-x)]2 

(N-l>*2  J 


2M       J^xfl 


=  <T 

since  2,  (xf-x)2  =  (JV-l>x2. 

This  expression  contains  the  unknown  variance  cr2,  which  we  need  to  estimate. 
The  minimum  sum  of  squares  Smin,  by  Eq.  (9),  may  be  written 

(11.5.12)       Smin  =  £  e?  =  X  [a,  -  i  -  *„(*.•  -  *)]2 

i  i 

=  I  e,2  -  N6"2  +  (JV  -  l)e,  V  -  2e„  £  «,(*,■  -  x) 

i  i 

=  X«f2-Ne2-(N-l)SxV 

i 

Now  8f/<7  is  by  hypothesis  a  standard  normal  variate,  and  the  sum  of  N  squares 
of  such  variates  is  distributed  as  x2  with  TV  degrees  of  freedom.  Also,  N1,2s/(7 
is  a  standard  normal  variate,  and  its  square  is  distributed  as  x2  with  1  d.f. 
Finally,  (N  —  \)1/2sxeb/<j  is  also  a  standard  normal  variate  and  its  square  is  a  x2 
variate  with  1  d.f.  These  last  two  normal  variates  are  both  linear  functions  of 
the  eja  and  it  is  easily  verified  that  they  are  orthogonal  (see  Appendix  A.  10).  It 
follows  from  Fisher's  Theorem  (§  4.7)  that  SmiJ<r2  is  distributed  as  x2  with  N  —  2 
degrees  of  freedom  and  is  independent  of  the  last  two  terms  on  the  right-hand 
side  of  (12). 
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From  the  known  properties  of  the  chi-square.  distribution, 

£%L  =  JV-2 
or 
(11.5.13)  E^-2  =  g2 

An  unbiased  estimator  of  o2  is  therefore 


(H.5.14)  62  =tt^  =  Vr— t:  Sy2(l  -  r2) 


Smin       N-l 

JV-2      N  -2 


and  on  substituting  this  for  a2  in  Eq.  (11)  we  obtain  for  the  estimated  variance 

ofj>c> 

(x  -  x)2  1 


(H.5.15)  ^)  =  *2[^ 


(N  -  l)sx< 


This  is  a  function  of  x,  having  its  least  value  when  x  =  x. 

We  may  be  interested  in  the  variance  of  Y  about  the  estimated  regression  line, 
that  is,  in  the  variance  of  Y  —  yc  for  some  new  assumed  value  x  of  X.  Since  the 
variance  of  yc  depends  only  on  the  N  observations  already  made  and  the  new 
observation  is  independent  of  these,  we  may  write 

(11.5.16)  V(Y  -  yc\x)  =  V(Y\x)  +  V{yc) 

A         1         (x  -  x)2  ] 

and  an  estimate  of  this  variance  is  given  by  putting  a2,  from  Eq.  (14),  in  place  of 
a2.  The  square  root  of  9(y  —  yc)  is  called  the  standard  error  of  estimate. 

1 1.6  Confidence  Limits  for  the  Parameters  of  Regression  and  for  Estimated  Y 

Since  the  slope  b  of  the  regression  line  (on  the  hypotheses  stated  above)  is  a 
normal  variate  with  expectation  /?  and  variance  o2l\(N  —  1)%2],  and  since  an 
independent  unbiased  estimate  of  a2  is  furnished  by  62,  with  N  —  2  degrees  of 

^2 — ~        nas  tne  Student-?  distribution 

^ G  J 

Using  Eq.  (11.5.14)  we  can  write  the  100(1  —  a)%  confidence  limits  for  /?  as 

where  ta  is  the  value  of  t  exceeded  numerically  with  probability  a.  This  may  also 
be  written 

(11.6.2)  /?  =  fe±/„_(__) 
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The  variance  of  a  is  a2  —  +   — — ~ 

[N      (N  -  \)sx2\ 


.  and  confidence  limits  for  the  true 

regression  parameter  a  may  be  found  in  a  similar  manner.  However,  it  is 
seldom  important  to  know  a.  In  most  regression  problems  it  is  the  slope  that 
matters. 

Example  3    For  a  sample  of  size  27,  we  find  that  b  =  0.163,  r  =  0.582. 
What  are  the  95%  confidence  limits  for  /?? 

For  25  d.f.,  t005  =  2.060.  Therefore  the  limits  are 

p=  0.163  ±2.060(0.0456) 
=  0.069  and  0.257 


If  P  =  0  (and  therefore  p  =  0),  the  quantity 


IN  -  2\ 

lty  '  T=?) 


1/2 


has  the  Student-/ 


distribution  with  N  —  2  d.f.   This  is  sometimes  useful  in  deciding  whether  an 
observed  value  of  r  differs  significantly  from  0. 


=  a+bx 


Fig.  51     Confidence  belt  for  estimate  from  linear  regression 

Example  4  For  a  sample  of  size  27,  suppose  that  r  =  0.348.  Is  this 
significantly  different  from  zero  ? 

Here  t  —  1.856,  with  25  d.f.  The  probability  of  a  value  numerically  as  great 
as  this  is  between  0.05  and  0.1,  so  that  at  the  5  %  level  of  significance  the  answer 
to  the  question  is  "No."  It  takes  a  fairly  large  value  of  r  to  be  significant  with  a 
sample  size  as  small  as  27. 

Since  the  expectation  of  Y  —  yc  is  zero  and  its  variance  is  given  by  Eq. 
(11.5.16),  it  follows  that 

(Y-yc) 


1+^+(^ 


x-x)2  -n 


1/2 
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where  a2  is  substituted  for  a2,  has  the  r-distribution  with  N  —  2  d.f.    The 
100(1  —  a)%  confidence  limits  for  Y  for  a  given  x  are  therefore 


(11.6.3) 

where  yc  =  a  +  bx  and 


a1 


1+1  + 


(jc  -  3c)2 


iV      (N-  1)V 


N-  1 
AT-2 


sYM  -  r2) 


1       1        (x  -  x)2 

1  +— + 


N      (N-  l)sx' 


The  curves  bounding  the  confidence  belt  are  hyperbolas  (Figure  51).    For 

Ye' 


large  N,  sY2  «  sy2(l  —  r2),  and  the  belt  is  almost  of  uniform  width 


11.7  Regression  when  the  Variable  X  Is  Not  Random  As  mentioned  in 
§  1 1.3,  it  often  happens  in  practice  that  the  values  of  X'm  an  experiment  are  pre- 
selected, so  as  to  be  convenient  numbers  instead  of  being  chosen  at  random. 
Sometimes,  also,  the  circumstances  of  observation  practically  dictate  the  values 
of  X,  so  that  the  observer  has  very  little  choice  in  the  matter.  The  problem  is 
then  not  really  one  concerning  two  variates,  but  rather  it  concerns  a  single 
variate  Y  which  depends  on  a  mathematical  variable  x.  The  assumption  is  that 

(11.7.1)  Y  =  a+Px+ex 

where  the  ex  are  normal,  with  expectation  zero  and  a  common  variance  a2  for 
all  x,  and,  for  different  values  of  x,  are  independent  of  one  another.  With  this 
assumption  (see  §  1 1.8)  the  maximum  likelihood  estimators  of  a  and  />  turn  out 
to  be  the  a  and  b  of  §  1 1.3.  Since,  however,  Xis  now  not  a  random  variable,  we 
must  understand  by  sx2  not  the  estimator  of  the  true  variance  of  X  but  merely  a 
symbol  for  the  quantity  £  (xt  —  x)2/(N  —  1),  where  x  =  £  xJN. 

As  we  have  seen  earlier,  when  X  and  Y  are  both  random  there  are  in  general 
two  distinct  regression  lines,  one  for  estimating  Y  for  given  values  of  X,  and  the 
other  for  estimating  X  for  given  values  of  Y.  The  former  is  still  good  when  X  is 
pre-selected  and  not  random  (the  least  squares  argument  of  §  1 1.3  is  still  valid) 
but. the  latter  has  no  meaning  in  this  case. 

It  sometimes  happens  that  we  would  like  to  estimate  X  for  a  given  value  of 
Y  even  though  X  is  not  random.  In  testing  an  insecticide,  for  example,  we  may 
wish  to  estimate  the  median  lethal  dose  (the  dose  that  will  kill  50  %  of  the  time) 
from  observations  made  on  the  proportions  of  insects  killed  with  various  known 
doses.   The  method  is  to  invert  the  regression  equation  yc  =  a  +  bx  and  write 


(11.7.2) 


yc-a 


where  yc  is  the  given  value  of  F(see  Figure  51).   The  100(1  —  a)%  confidence 
limits  for  jc  are  xx  and  x2,  these  being  the  abscissae  of  the  points  where  the  line 
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Y  =  yc  cuts  the  confidence  band.  The  values  of  xx  and  x2  can  be  calculated 
from  (11.6.3)  by  putting  yc  =  a  +  bx  ±  tasYe  and  treating  this  equation  as  a 
quadratic  in  x.  The  two  roots  are  the  required  limits,  provided  that  b  is  suffi- 
ciently large.  For  values  of  b  too  small  to  be  significantly  different  from  zero 
at  the  level  a,  it  may  happen  that  no  finite  confidence  interval  for  x  exists,  but 
this  case  is  not  of  much  practical  importance. 

*  11.8  Maximum   Likelihood   Estimation  of  the  Regression   Coefficients     We 

consider  first  the  case  when  X  is  "fixed"  (pre-selected),  Y  being  given  by  Eq. 
(11.7.1),  with  the  assumptions  there  stated.  If  et  denotes  the  value  of  sx  when 
x  =  xh  i  =  1,  2  .  .  .  N,  the  joint  distribution  of  the  N  values  e,  has  the  density 
function 


H£) 


(11.8.1)  g(eli...eN)=(2jiaTN,zexp 

Let  the  observed  values  of  Y  corresponding  to  the  fixed  values  xt  be  denoted 
by  yt.  The  joint  density  function  for  the  yt  is  f(yli  y2  .  .  .  yN),  where 

(11.8.2)  /(vl5  y2  .  .  .  yN)  dy1  dy2  .  .  .  dyN  =  g(eu  e2  .  .  .  eN)  de1  .  .  .  deN 
The  yt  are  related  to  the  ef  by  equations 

(11.8.3)  yt  =  <*  +  Pxt  +  fy,         i  =  l,2...N 

and  the  Jacobian  of  the  transformation  from  the  e,  to  the  yh  for  fixed  xh  is  equal 
to  1.   It  follows  that 

(11.8.4)  f(yu  y2...yN)=  (2na2yN/2  exp[~E  (y§  "  \~2  ^ 

The  principle  of  maximum  likelihood  suggests  that  we  choose  a  and  /?  to 
maximize  this  function  f(y\  .  .  .  yN).  This  is  equivalent  to  minimizing  £f  (yt 
-  a  —  pXi)2,  and  on  differentiating  partially  with  respect  to  a  and  /?  and 
equating  the  derivatives  to  zero,  we  arrive  at  the  following  equations  for  the 
maximum  likelihood  estimators  a  and  ft : 

(11.8.5)  '      ,  A       *    ,      n 

These  are  identical  with  the  equations  of  (1 1.3.5)  so  that  6t  —  a,  ft  =  b.  We  get 
the  same  result  as  by  the  method  of  least  squares. 

The  above  argument  does  not  hold  if  A"  is  a  random  variable.   If  the  joint 
probability  density  for  Xand  7is/(x,  y),  the  likelihood  for  the  observed  sampl 
is 

(11.8.6)  L  =f(xu  yi)f(x2,  y2) .  .  ./(xN,  yN) 
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If  we  transform  to  new  variates  U  and  e  by  the  relations 

(11.8.7)  [X  =  U 

the  Jacobian  of  the  transformation  is  1,  and  therefore  Eq.  (6)  becomes 

(11.8.8)  L  =f(ul9  oc  +  puy  +  £!)/(w2,  a  +  fiu2  +  e2)  .  .  ./(wN,  a  +  puN  +  %) 

If  this  splits  into  two  factors,  one  of  which  depends  on  the  et  alone  and  the 
other  on  the  ut  alone,  we  must  have 

f(u,  <x+Pu+e)=g(u)-h(£) 

or,  equivalently, 

(11.8.9)  f(x,y)=g(xyh(y-a-Px) 

If  and  only  if  this  condition  is  satisfied,  we  can  feel  justified  in  estimating  a  and  p 
from  the  distribution  h(e)  of  s  alone.  When  this  distribution  is  normal  we  get 
back  to  the  equations  (5).  The  condition  (9)  is  satisfied,  for  example,  if  the 
population  is  bivariate  normal.  Using  the  standardized  variates  z  and  v,  we  see 
from  Eqs.  (11.2.5),  (11.2.6)  and  (11.2.8)  that 

(11.8.10)  g(z,v)=f(z)-g(v\z) 

=  l(2ny^e-^j2ny^(l  -  p2)"1/2  exp["|r^] 

The  first  factor  is  a  function  of  z  alone  and  the  second  of  v  —  pz  alone.  In  these 
variates  v  —  pz  is  equivalent  to  y  —  oc  —  fix'. 

It  should  be  observed  that  when  condition  (9)  is  satisfied  the  regression  is 
necessarily  linear.   For 

(11.8.11)  E(Y\X  =  x)  =  J  yh(y  -  a  -  jfo)  dy 

=  J  (y  -  a  -  pX)h(y  -ol-Px)  dy 

+  (a  +  Px)  J  h(y  -  a  -  Px)  dy 

=  E(e)  +  (a  +  px) 

since  h(e)  is  a  density  function. 

The  first  term  can  be  taken  as  zero  by  suitably  adjusting  a,  and  we  thus  have 
the  ordinary  equation  for  linear  regression. 

1 1 .9  Functional  Relation  Between  Variables  Subject  to  Error  It  often  happens 
that  the  variables  X  and  Y,  whether  "fixed"  or  random,  are  subject  to  experi- 
mental error.  With  random  variables  this  error  is  mixed  up  with  the  fluctuation 
due  to  the  underlying  probability  distribution.   We  will  therefore  suppose  to 
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begin  with  that,  apart  from  the  error,  X  and  Y  are  fixed  and  that  Y  is  a  linear 
function  of  X.  That  is, 


(11.9.1) 

where 
(11.9.2) 


X  =  Z  +  u 
Y  =ri  +v 

*=«  +  « 


The  quantities  d,  and  rj  are  regarded  as  the  true  values  of  the  variables,  and  u  and 
v  as  the  errors.  We  suppose  that  u  and  v  are  uncorrected  with  each  other,  and, 
for  any  value  of  f ,  are  distributed  normally  with  means  zero  and  variances  g2 
and  o2  respectively. 

The  joint  density  function  for  X  and  Y  is  then 

(11.9.3)  f(x,  y)  =  {2nauavyl  exp[-^T^  "  ^fl 


2<x. 


2(7. 


The  likelihood  function  for  a  set  of  N  pairs  of  observed  values  (jcf,  y{)  is 

(11.9.4)  L  =  (2n<Fjroy    exp 

so  that 

(11.9.5)  log  L  -  C  -  N  log  au  -  N  log  <7„  -  -^  £  (xt  -  ff)2 

2<7M     ,- 

The  right-hand  side  contains  N  +  4  unknown  parameters,  namely  a,  /?,  oui 
av  and  the  TV  values  £f.  The  maximum  likehhood  equations,  found  by  differenti- 
ating partially  with  respect  to  each  of  these  parameters  and  setting  the  derivatives 
equal  to  zero,  are 


(11.9.6) 
(11.9.7) 
(11.9.8) 
(11.9.9) 


E(y,-«-«,)«0 
IM*-«--Wi)-0 


(11.9.10)     (xi  -  «/<Jtt2  +  j80,  -  a  -  PQIgv2  =  0, 
From  Eqs.  (8)  and  (10)  we  find 


i  =  1, 2  . . .  N 
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and,  on  substituting  from  Eq.  (9), 

2 

(11.9.11)  P2=—2 

Since  this  relation  cannot  be  supposed  to  hold  in  general,  the  maximum  likeli- 
hood method  is  not  satisfactory  unless  some  further  assumption  is  made.  The 
most  convenient  one  to  make  is  that  the  ratio  of  a2  to  a2  is  definitely  known. 
If  this  ratio  is  denoted  by  A,  and  if  we  let  a2  =  <r2,  gv2  =  Ac2,  we  obtain, 
instead  of  Eqs.  (8)  and  (9),  the  one  relation 

(11.9.12)  2NA<72  =  A  £  (xt  -  ft)2  +  £  (y,  -  a  -  /?ft)2 
and  instead  of  Eq.  (10) 

(11.9.13)  X(xt  -  ft)  +  P(yt  -  a  -  jffft)  =  0 

Substituting  from  Eq.  (13)  in  Eqs.  (6)  and  (7),  we  find,  after  some  rearrangement 
of  terms,  that 

(11.9.14)  Na+X^=Z>?. 
and 

(11.9.15)  £  xfl  +  I  xfp  =  £  xtf,  +  £  (I  y,-2  -  a  J  y,  -  /I  J  x^) 

By  eliminating  a  from  these  two  equations  we  obtain  a  quadratic  equation  in 
/?,  which  reduces  to 

(11.9.16)  sXYp2  +  (Asx2  -  sy2)/?  -  AS*y  =  0 

where  sx29  sY2  and  sXY  are  the  variances  and  covariance  of  the  observed  sample 
values  xt  and  yt.  The  estimator  of  /?  is 

(11.9.17)  j}  =  {sy2  -  As/  +  [(sy2  -  As/)2  +  4Asxy2]1/2}/(2s*y) 

The  estimator  of  a  is  found  from  Eq.  (14)  and  that  of  a2  from  Eq.  (12).  It  turns 
out  that 

(11.9.18)  d2  =^"W2  +  tex2  -  [(Sy2  -  lsx2)2  +  UsXY2Y'2} 

Unfortunately,  as  Lindley  [1]  has  shown,  this  is  not  a  consistent  estimator  of  a2. 
It  tends  in  probability,  as  N  increases,  to  the  value  d2/2. 

If  we  use  the  least  squares  method,  it  is  no  longer  correct,  as  in  the  classical 
regression  problem,  to  minimize  the  sum  of  squares  of  deviations  parallel  to  the 
>>-axis.  Since  the  values  of  X  observed  are  no  longer  the  true  values,  the  variance 
of  X  must  be  taken  into  consideration.  One  way  is  to  minimize  the  sum  of  squares 
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of  the  X  deviations,  weighted  inversely  as  o2,  plus  the  sum  of  squares  of  the  Y 
deviations,  weighted  inversely  as  o2.  That  is,  we  minimize 

A£(*,-&)a+lG*-a-«i)2 

with  respect  to  a,  P  and  the  {,.  Another  method  is  to  weight  the  squared  devi- 
ation of  yt  from  a  +  fixt  with  a  weight  inversely  proportional  to  the  variance  of 
yt  —  a  -  Px{.  Since  this  variance  is  o2  +  P2o2  =  o2(l  +  p2\  the  expression 
to  be  minimized  is 

(yt  -  g  -  fc)2 

Differentiating  partially  with  respect  to  a  and  p  {X  is  supposed  known,  as  in  the 
maximum  likelihood  method),  we  obtain 

(11.9.19)  y=6t+jlx 
and 

(11.9.20)  j?  £  (y,  -  a  -  fe)2  +  (/I  +  j?2)  £  x^i  -  4  -  fo)  =  0 

If  we  imagine  the  origin  shifted  to  the  point  (x,  y),  which  will  not  affect  the  slope 
of  the  line,  we  can  put  6t  =  0 — from  Eq.  (19) — and  then 

(11.9.21)  p  X  (%•  "  fad2  +  (A  +  j§2)  £  rfto  -  /fee)  =  0 
which  can  be  written 

(A  -  j§2)  £*,.>,.+)?  2>,.2  -  Ax,.2)  =0 

With  the  new  origin,  £ x2,  £  j;.2,  and  J] x^  are  proportional  to  5X2,  sY2  and 
5xy  respectively,  so  that 

(11.9.22)  (A  -  y82)s^y  +  j§(sy2  -  As*2)  =  0 

which  is  the  same  as  Eq.  (16).  The  estimator  /?  is  therefore  the  same  as  that 
furnished  by  the  method  of  maximum  likelihood,  but  now  a  consistent  estimator 
of  g2  is  obtainable  by  dividing  the  minimum  sum  of  squares  by  the  number  of 
degrees  of  freedom,  N  -  2.  That  is, 

1         (y.-Px,)2 
d   =Tt — oL 


N-2^     X+$2 

I  £  xtifai  -  yd 


N-2p 


by  Eq.  (21).  Therefore 


¥^ii(s>'2 "  ^Sxr)' by  Eq- (22) 


302  INTRODUCTION  TO  STATISTICAL  INFERENCE  11.10 

A  good  discussion  of  the  functional  relation  between  variables  subject  to  error 
may  be  found  in  reference  [2]. 

*  11.10  The  Regression  Relation  Between  Variables  Subject  to  Error    If  7  is  a 

random  variable,  subject  also  to  error,  and  if  X  is  "fixed,"  that  is,  pre-selected, 
the  relations  corresponding  to  Eqs.  (11.9.1)  and  (11.9.2)  are 

'X  =  {  +  u 

{  Y  =rj  +  e  +  v 

If  (xh  yt)  are  the  observed  pairs  of  values, 

=  {|  +  «! 

{ yt  =  m  +  8i  +  »i 

M  =  «  +  Ki 

where  successive  observations  are  independent,  the  ut  and  vt  are  uncorrected 
with  each  .other  and  with  the  ei9  and  where  uh  vt  and  st  have  zero  expectations 
and  variances  a2,  o-„2,  (Tc2,  for  all  /.  Then 

(11.10.3)  ox2=au\        <jY2=oE2  +o2 

The  analysis  of  §  1 1.9  then  holds  with  aY2  in  place  of  o2.  The  only  difference 
is  that  the  variance  of  Y  is  now  partly  due  to  its  inherent  nature  as  a  random 
variable  and  partly  due  to  experimental  error. 

If  we  are  interested  in  estimating  the  regression  of  Y  on  the  observed  X  rather 
than  on  the  true  ?,  we  simply  treat  the  X  as  being  without  error.  That  is,  we  use 
the  technique  of  §  1 1 .7,  and  our  estimator  of  P  is  the  ordinary  one  obtained  in 
§  11.3,  namely, 

(11.10.4)  p  =  b=—2 

sx 

This  is  obtained  as  a  special  case  of  (11.9.16)  when  a2  =  0  and  therefore 
X  -*  oo. 

If  both  X  and  Y  are  random  variables  (the  structural  situation  mentioned  in 
§  11.3),  the  equations  (1)  still  hold,  but  £,  is  now  a  random  variable  with 
expectation  \i  and  variance  o2.  Then  rj  is  also  a  random  variable  with  expec- 
tation a  +  P\i  and  variance  $2g2.  On  the  assumption  that  u,  v  and  e  are 
uncorrelated  with  each  other  and  with  £,  we  have 

(11.10.5)  ax2  =  g2  +  (7U2,        aY2  =  £  V  +  <re2  +  a2,        aXY  =  ^2 

The  sum  ce2  +  ct,2  may  be  denoted  by  gv2,  as  the  two  components  cannot  be 
distinguished.   The  maximum  likelihood  treatment  of  this  case  is  discussed  in 
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reference  [3].  Estimators  of  /*,  a  and  /?  are  connected  by  the  relations 

(11.10.6)  /2=3c,        6L+fix=y 

but  in  order  to  find  /?  it  is  necessary  to  assume  that  either  a2,  gv2  or  the  ratio 
X  =  <tv,2/(tu2  is  known.  If  X  is  known,  ft  is  given  by 


(11.10.7)  P  =  t  +  (t2  +  A)1/2, 


Sy       —  XS^ 

2si 


2  ;, 


This  is  the  same  as  Eq.  (1 1.9.17).  An  estimator  of  a^  is  given  by 
(11.10.8)  fa2=*xY=(N-l)Sj? 

11.11  The  Method  of  Grouping  A  very  simple  method  of  fitting  a  straight 
line,  when  X  and  Y  are  functionally  related  but  subject  to  error,  has  been 
suggested  by  several  writers,  notably  A.  Wald  and  M.  S.  Bartlett  (see  references 
[4]  and  [5J). 

The  observed  N  pairs  are  ordered,  usually  by  reference  to  the  X  values,  and 
divided  into  three  groups.  A  number  p  is  chosen  (p  <  \  and  such  that  Np  is 
integral)  and  then  the  first  Np  observations  are  put  in  group  Gl9  the  last  Np  in 
group  G3,  and  the  remainder  in  group  G2.  Wald  suggested  taking/?  as  near  as 
possible  to  i,  so  that  G2  was  either  empty  or  contained  one  observation.  Bartlett 
suggested  taking  p  approximately  ^,  which  in  general  gives  greater  accuracy.  The 
exact  value  is  not  very  important,  but  studies  [5]  indicate  that  the  numbers  in 
the  three  groups  Gu  G2  and  G3  should  be  nearly  in  the  ratio  1:2:1  for  maximum 
efficiency. 

The  method  consists  in  plotting  the  points  A  and  B,  which  are  the  centroids 
of  the  two  groups  of  points  Gt  and  G3,  and  joining  AB.  The  coordinates  of  A 
and  B  are  the  group  means  (xu  yt)  and  (3c3,  j>3).  The  slope  of  AB  is  an  estimator 
of  p.  The  line  parallel  to  AB  through  the  over-all  mean  (x,  y),  at  C,  is  the  line 
required,  with  equation 

(li.n.i)  y-y  =  fcx-x) 

where 

(11.11.2)  /?  =  I3  -  I1 

x3  —  Xj 

(see  Figure  52). 

This  quantity  ft  is  a  consistent  estimator  of  /?  if  two  conditions  are  satisfied : 
(1)  the  grouping  should  be  independent  of  the  errors  of  observation,  (2)  the 
quantity  x3  —  xt  should  not  approach  zero  as  N  ->  oo.  The  second  condition 
is  obviously  satisfied  if  the  observations  are  ordered  according  to  their  increasing 
true  values  {.  Unfortunately  we  do  not  know  the  true  values  and  the  order  of 
the  observed  values  x  may  not  be  independent  of  the  errors. 
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The  precise  conditions  for  the  consistency  of  p  are  difficult  to  satisfy,  par- 
ticularly if  we  assume  that  the  error  in  X  is  normally  distributed.  Theoretically 
an  error  of  any  magnitude  whatever  is  possible,  since  the  range  of  a  normal 
variate  is  infinite.  Practically  the  probability  of  an  error  numerically  greater 
than  5  =  4au  is  negligible.    If  the  values  of  f  corresponding  to  cumulative 
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Fig.  52    Fitting  a  straight  line  by  the  method  of  grouping 

probabilities  p  and  1  -  p  are  denoted  by  £p  and  ^!_p,  the  grouping  by  observed 
values  will  be  practically  the  same  as  grouping  by  true  values,  provided  that 
scarcely  any  observed  values  fall  in  the  intervals 

K,  -  <5,  {,  +  5]     and     l^.p-d,^_p  +  dl 

As  in  §  1 1 .9,  we  assume  that  the  errors  u  =  X  —  £  are  independent  and 
normal  with  common  variance  a2,  the  errors  v  =  Y  —  r\  are  independent  and 
normal  with  common  variance  cry2,  and  u  and  u  are  uncorrelated  with  each  other. 
Then  it  is  possible  to  determine  confidence  limits  for  p. 

From  Eq.  (2),  we  have 


(11.11.3) 


(x3  -  xM  -P)  =  h-yi-  «5c3  -  xt) 

=  (v3  -  Pu3)-(vl  -  PUi) 


since  yt  =  nt  +  vt  =  a  +  p£t  +  vt  and  *|  =  f j  +  w*.  The  variance  of  v3 
or  vx  is  cr^/A:,  where  A:  =  Np,  and  that  of  u3  or  i^  is  ffu2/A:,  so  that  the  variance  of 
the  right-hand  side  of  Eq.  (3)  is  (a2  +  p2au2)(2/k). 

For  the  points  in  group  Gl9  yt  —  yx  —  p(xt  —  xx)  =  vt  —  vx  —  P{ut  —  Uj), 
so  that 

(11.11.4)  £  [yt  -yt-  P(Xi  -  xJY  =  £  (pt  -  v,)2 

+  /?2  X  (ut  -  ux)2  -  2/?  £  (vt  -  vtfm  ~  Hi) 

and  the  left-hand  side  is  therefore  an  estimator  of  (k  —  \)(o2  +  P2(?u2)>  The 
corresponding  sums  overG2  and  G3  are  estimators  of  (N  —  2k  —  \){gv2  +  P2o2) 
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and  (k  —  l)(av2  +  P2vu2)  respectively.    The  three  sums  combined  give  an 
estimator  of  (TV  -  3)(crv2  +  P2o2). 
If  we  write 

(11.11.5)         (N  -  3)sx2  =  X  (x,  -  x,)2  +  X  (*,  -  x2)2  +  £  (x,  -  3c3)2 

Gi  G2  G3 

and   similar  expressions   for   (TV  —  3)s2   and   (TV  —  3)^,   the  estimator    of 
a2  +  £2<rM2  is  s2  -  2psxy  +  /?  V  =  S(p),  say. 

Since  on  our  assumptions — see  Eq.  (3)  above — the  statistic  (x3  —  Xi)(/?  —  /?) 
is  normally  distributed  about  zero,  and  since  an  independent  estimator  of  its 
variance  with  TV  -  3  d.f.  is  2S(p)/k,  the  quantity 

(111L6)  '-     [2S(/0/*]"2 

has  the  Student-r  distribution  with  TV  —  3  degrees  of  freedom.  If  /a  is  the  value 
of  t  exceeded  in  absolute  value  with  probability  a,  the  quadratic  equation 


2 


(11.11.7)  ^-  [>/  -  2jfe_  +  /?V]  =  (S,  -  x,)2^  -  J5): 


where  /?  is  given  by  Eq.  (2),  provides  100(1  —  a)%  confidence  limits  for  /?.  If 
/?„  is  the  upper  95  %  limit,  a  rough  estimate  of  the  standard  deviation  of  ft  is 
given  by  (fiu  -  j5)//0.os- 

Example  5  The  friction  (ounces)  in  a  simple  machine  is  y  when  the  load  is 
x  ounces. 

x  |     23.4    44.7     65.4     86.8     107.5     128.8     149.6     171.0 
71       34      4/7      54      6^8        7/5        O 

Taking  the  first  two  and  the  last  two  measurements, 
x3  =  160.3,    y1  =  4.05,    y3  =  10.3;    also    x  =  97.15, 
j8  =  0.0495,  and  the  line  fitted  is  y  =  0.0495x  +  2.34. 
Eq.    (5)    that    Ssx2  =  226.8  +  2224.0  +  229.0  =  2679.8, 
=  143.85,  so  that  S0)  =  1.5705  -  57.54£  +  535.96£2  = 

The  95  %  confidence  limits  for  ft  are  given  by  Eq.  (7)  with  k  =  2,  ta  =  2.571. 
This  equation  reduces  to  ft2  -  0.0966 lj?  +  0.002313  =  0,  the  roots  of  which 
are  p  =  0.0437  and  0.0529. 

If  we  use  Wald's  method  with  two  groups  of  four  observations  each,  the 
central  group  G2  is  empty  and  TV  —  3  in  Eq.  (5)  must  be  replaced  by  TV  —  2. 
We  find  x\  =  55.075,  x3  =  139.225,  yx  =  5.075,  y3  =  9.225,  with  x  and  y  as 
before.  The  line  fitted  is  y  =  0.0493*  +  2.36.  Also  6sx2  =  4456.5,  6s2  =  12.48, 
6sxy  =  234.48,  so  that  S0)  =  2.08  -  78.16)8  +  742.75j§2  =  0.0321  with 
j8  =  0.0493. 

The  95%  limits  given  by  Eq.  (7),  with  k  =  4,  are  0.0430  and  0.0526;  this 
method,  therefore,  in  spite  of  using  all  the  observations,  gives  a  slightly  less 
reliable  value  of  the  slope  than  the  method  which  uses  only  the  first  two  and 
last  two  observations. 


9.6       11.0 

we  have  x 

j  = 

34.05, 

y  =  7.15. 

Therefore 

We  also 

find 

from 

,    5sy2  =  7.8525, 

5? 
~>jxy 

=  0.0354. 
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11.12  The  Distribution  of  the  Pearson  Correlation  Coefficient  We  have  seen 
that  if  the  parent  population  is  uncorrelated  (i.e.,  p  —  0)  and  if  X  and  Y  are 
normally  distributed,  the  quantity 

(11.12.1)  t  =  r(N-  2)1/2(1  -  r2y1/2 

has  the  Student-/  distribution  with  N  -  2  d.f.  If  u  =  r2/(i  -  r2)  =  t2/(N  -  2), 
it  follows  from  the  result  of  §  8.5  that  u  is  distributed  like  the  ratio  of  two 
independent  x2  variates  with  1  and  N  —  2  d.f.  respectively.  This  in  turn  means 
(see  §  4.5)  that  u  is  a  beta-prime  variate  with  parameters  i  and  (N  —  2)/2.  Its 
density  function  is  therefore 

(11.12.2)  /(«)=«T"2(1  +«)-<""  W/Bg,  ^) 

The  density  function  for  r  is  obtained  by  putting  2g(r)  dr  —  /(w)  du.  The 
factor  2  arises  because,  as  r  goes  from  —  1  to  1 ,  u  goes  from  +  oo  to  0  and  back 
to  +oo.   Since  du/dr  =  2r(l  —  a*2)-2,  we  have 

(11.12.3)  g(r)  =  (1  -  r2)<"-*>'2/B(i  ^-?) 

The  graph  of  g{r)  is  a  symmetrical  bell-shaped  curve  for  N  >  5.  Because  of 
the  symmetry,  E{r)  =  0,  and  it  is  easily  proved  that 

(11.12.4)  E(r2)=(N-iyl 

The  standard  deviation  of  r  is  therefore  (N  —  1)"1/2. 

The  kurtosis  is  *-6/(N  —  1),  which  tends  to  zero  as  N  increases.  For  very 
large  TV  the  distribution  is  approximately  normal. 

It  is  not  necessary  to  assume  a  bivariate  normal  distribution.  Provided  that 
X  and  Y  are  independent  and  that  at  least  one  is  a  random  sample  from  a 
univariate  normal  distribution,  the  distribution  of  Eq.  (3)  holds. 

When  p  is  not  zero,  the  exact  distribution  of  r  is  quite  complicated.  It  was 
first  found  by  Fisher,  using  an  essentially  geometrical  argument.  An  analytical 
treatment  may  be  found  in  [7],  and  a  full  discussion  by  Hotelling  [8]  is  probably 
the  last  word  for  some  time  on  this  subject.  The  density  function  for  r  is 

(11.12.5)  /(r,  p)  =  ^— ?  (1  -  p2r~  1)/2(1  -  r2)(iV-4)/2/(pr) 

n 

where  I(pr)  is  given  by 

f00  du 

(11.12.6)  I(pr) 


-!/■ 


(cosh  u  —  prf   1 
The  integral  can  be  expressed  as  a  rapidly  convergent  series,  and  we  obtain 

,,,,„,     „      ,     (AT  ~  2)r(JV  -  1X1  -  P2)<*-1>/2(1  -  rT^  „.    . 
(11.12.7)     /(r,p)= (2,)./2r(jv_iX1_pr)(»W-3V» S^> 


11.13 
where 

(11.12.8) 
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5(pr)  =  1+T^L±L  +  ^ 


(pr  +  1): 


4(2N  -  1)      32  (2N  -  1)(2N 


TT)  +  °(f) 


Tables  of  the  function  /(r,  p)  and  of  its  integral  have  been  prepared  by 
Miss  F.  N.  David  [9],  who  included  also  several  charts  from  which  approximate 
confidence  limits  for  p  may  readily  be  obtained. 

The  distribution  of  r  is  far  from  normal,  particularly  when  p  is  near  +  1  or 
—  1.  Series  expressions  may  be  found  for  the  cumulants  of  this  distribution,  as 
follows  (with  n  written  for  N  -  1): 


(11.12.9) 


In 

2\2 


Vi  = 


k3 


K2 


3/2 


-6pr 

"  n1/2  L 


1  + 


11  p2  -  30 
\2n 


+ 


•■■] 


lV2=J^  =  6(lV_1)  + 
v         k2       n 


Thus  if  p  =  0.8  and  N  =  50  we  find  that  yt  =  -0.71  and  y2  =  0.82. 

11.13  Fisher's  Transformation    Fisher  showed  that  if  we  transform  to  a  new 
variable  z'  by  the  relation 

1  +r 


(11.13.1) 


z'  =  tanh    *  r  =  \  loge 


then  z'  is  approximately  normally  distributed  with  variance  \/(N  —  3),  whatever 
the  value  of  p.   This  remark  enables  us  to  assess  readily  the  significance  of  an 
observed  value  of  r,  without  having  to  use  David's  tables. 
If  (  =  tanh-1  p,  it  may  be  proved  that: 

(11.13.2)  £(z')==^+|.  +  o(I) 

where  n  =  N  —  1 .  Also 


(11.13.3) 
(11.13.4) 
(11.13.5) 


1      4-p2 
n        In 


°® 
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For  p  =  0.8  and  N  =  50,  yx{z')  =  0.0015  and  y2(z')  =  0.042.  This  shows 
clearly,  when  compared  with  the  values  given  at  the  end  of  §  11.12,  how  much 
more  nearly  normal  z'  is  than  r. 

From  Eq.  (3),  the  variance  depends  to  some  extent  on  p.  For  p  =  0  it  is 
approximately  l/(n  —  2)  and  for  p  —  ±1  approximately  \/(n  —  3/2).  The 
Fisher  value  I/O*  —  2)  is  therefore  reasonably  close  for  any  p. 

Example  6  For  20  students  the  correlation  coefficient  between  scores  on 
two  tests  was  0.65.  What  are  the  95 %  confidence  limits  for  pi 

Assuming  a  normal  distribution  for  z',  the  95  %  limits  will  be  z  ±  1.96/^17. 
Since  the  expectation  of  z'  is  (  +  p/38,  the  limits  for  £  will  be  z'  —  p/38  ± 
1.96/^/TT.  Now  z'  =  tanh-1  0.65  =  0.775,  but  p  is  unknown.  However,  since 
p/38  is  small,  we  can  substitute  for  p  the  sample  value  0.65,  and  this  gives 
C  «  0.775  -  0.017  ±  0.475  =  0.283  to  1.233.  The  corresponding  limits  for 
p  (=  tanh  0  are  0.276  to  0.844.  Direct  reading  from  the  chart  in  David's  tables 
gives  0.28  and  0.84.  Appendix  B.ll  gives  a  table  of  r  =  tanh  z' . 

The  Fisher  transformation  is  another  example  of  the  transformations  of 
variables  considered  in  Chapter  4  (§  4.3).  It  achieves  at  the  same  time  approxi- 
mate constancy  of  variance  and  approximate  normality.  It  is  convenient  in 
taking  the  average  of  the  correlations  obtained  from  several  samples,  supposedly 
from  the  same  population.  The  values  of  r  are  transformed  to  values  of  z'  and 
each  is  given  the  weight  TV  —  3,  inversely  proportional  to  its  variance.  The 
weighted  mean  of  the  z'  is  then  transformed  back  to  r. 

Example  7    Separate  samples  of  sizes  50,  70  and   100  give  correlation 
coefficients  0.72,  0.68  and  0.77.  What  is  the  best  value  to  use  as  an  average? 
The  z'  are  0.908,  0.829,  and  1.020.  The  weighted  mean  is 

z'  =  —  [47(0.908)  +  67(0.829)  +  97(1.020)]  =  0.934 
The  corresponding  r  is  0.733. 

11.14  Rank  Correlation  It  is  sometimes  possible  to  place  a  group  of 
individuals  in  order  with  respect  to  some  characteristic  without  having  to 
measure  this  characteristic  numerically  for  each  one.  For  instance,  a  judge  in  a 
contest  may  have  to  rank  a  group  of  young  ladies  for  beauty  or  a  sales-manager 
may  rank  a  group  of  salesmen  for  keenness  or  efficiency.  One  of  the  classic 
methods  of  estimating  hardness  for  minerals  was  by  a  rank  order — mineral  A 
was  said  to  be  harder  than  B  if  a  piece  of  A  would  scratch  a  piece  of  B  when  the 
two  were  rubbed  together.  If  A  scratched  C  but  C  scratched  B,  then  C  was 
intermediate  in  hardness  between  A  and  B.  A  simple  scale  could  be  established 
by  taking  some  standard  minerals  and  labelling  them  1,  2,  3,  and  so  on,  in  the 
proper  order,  but  this  does  not  give  any  true  measurement  of  relative  hardness. 

If  the  same  individuals  are  ranked  in  two  ways,  say  by  different  judges  or 
according  to  different  criteria,  the  degree  of  concordance  between  the  two 


11.14  DISTRIBUTIONS  OF  PAIRS  OF  VARIATES  309 

rankings  may  be  of  interest.  The  coefficient  of  rank  correlation  is  intended  to 
measure  this  concordance.  If  the  two  judges  agree  perfectly  in  their  rankings, 
so  that  the  ith  individual  in  one  ranking  is  also  ith  in  the  other  ranking  (i  =  1, 
2  .  .  .  N),  we  should  expect  the  coefficient  of  correlation  to  be  1 ;  if  the  judges  are 
diametrically  opposite,  so  that  the  ith  individual  in  one  ranking  is  the  (N  —  i  +  1  )th 
in  the  other,  the  coefficient  of  correlation  should  be  —  1 .  If  the  ranks  are 
assigned  by  pure  chance  we  should  expect  a  value  of  the  correlation  coefficient 
near  zero.  The  two  principal  methods  of  calculating  rank  correlation  agree  in 
these  extreme  cases,  but  differ  in  the  values  assigned  to  intermediate  degrees  of 
concordance. 

Spearman's  coefficient  rs  depends  on  the  differences  dt  =  xt  —  yt  between 
the  ranks  xt  and  yt  of  the  same  individual  on  the  two  rankings.   In  fact, 

(1114.1)  rs  =  l-6^,.2/[iV(iV2-l)] 

Example  8     Suppose  that  seven  bathing  beauties,  labelled  A  to  G,  were 
ranked  by  two  judges  as  in  the  following  table : 

Table  11.3 


Contestant 

A    B    C    D  E   F    G 

(x)  Judge  1 
(y)  Judge  2 

2  14     5     3     7     6 

3  4    2     5     16     7 

d2  =  (x-  yf 

19     4    0    4     1     1 

The  differences  of  the  ranks  are  squared  in  the  last  row.    Here  N  =  1  and 
£  dt2  =  20,  so  that 

120 

r°=l-mr0M 

This  suggests  that  the  judges  agree  reasonably  well,  but  the  question  of  signifi- 
cance will  be  taken  up  later. 

Kendall's  method  [6]  depends  on  giving  each  possible  pair  of  individuals  in 
the  sample  a  score  of  + 1  or  —  1  according  as  their  ranks  are  in  the  same  order 
or  in  the  opposite  order  on  the  two  rankings.  Thus,  in  Example  8  above,  A  ranks 
above  D  according  to  both  judges,  and  so  the  pair  AD  gets  a  score  +  1.  On  the 
other  hand  A  is  below  B  on  one  ranking  but  above  on  the  other,  so  the  pair  AB 

/N\ 
gets  a  score  —  1.   The  total  number  of  pairs  is  I     1  =  N(N  —  l)/2.   If  S  is  the 

total  score, 

(11.14.2)  ^VQ 

It  is  not  necessary  to  consider  every  one  of  the  pairs  in  this  way.  The  same 
result  is  obtained  by  writing  the  x(  in  their  natural  order  and  then  for  each  yt 
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counting  how  many  numbers  there  are  to  the  right  of  this  yt  and  greater  than  it. 
Let  this  number  be  nt  and  let  P  =  Jj=1  nt.   Then 

(1L14-3>  ^^br1 

Rewriting  the  above  table,  we  have 

Table  11.4 


Contestant 

B 

A 

E 

C  D 

G 

F 

Xi 

yt 

1 
4 

2 
3 

3 
1 

4     5 
2    5 

6 

7 

7 
6 

rii 

3 

3 

4 

3     2 

0 

0 

so  that  P  =  15  and  rK  =  60/42  -  1  =  0.43.  (The  number  4  for  yt  in  the  column 
headed  B  has  three  numbers  to  the  right  which  are  greater  than  4.  This  gives 
**  =  3.) 

The  reason  that  Eq.  (3)  gives  the  same  result  as  Eq.  (2)  is  that  only  pairs  with 
their  yt  increasing  from  left  to  right  will  give  a  positive  contribution  to  the  score 
(since  the  xt  always  increase  from  left  to  right).  Then  P  is  the  total  positive 
contribution  to  S.  If  the  total  negative  contribution  is  —  Qt  P  +  Q  = 
N(N  -  l)/2,  so  that  S,  which  is  defined  as  P  -  Q,  is  IP  -  N(N  -l)/2.  On 
substituting  this  value  of  S  in  Eq.  (2)  we  arrive  at  Eq.  (3).  The  modification  of 
this  procedure  due  to  ties  in  the  ranking  will  be  considered  in  §  11.16. 

*  11.15  Relation  Between  Rank  Correlation  and  Pearson  Correlation  Suppose 
we  have  N  individuals,  or  sample  items,  which  are  numbered  for  identification 
and  are  given  ratings  on  two  attributes  X  and  Y.  These  ratings  may  be  ranks  or 
numerical  values,  and  for  the  ith  item  will  be  denoted  by  xi9  yt.  For  any  pair  of 
items,  numbered  i  and  j,  we  suppose  that  a  score  atj  is  allotted  on  attribute  X 
and  a  score  btj  on  Y.  These  scores  will  naturally  depend  on  the  values  xt  and  Xj  or 
yt  and  yj9  but  we  merely  require  that  atj  =  —ai{  and  therefore  that  atj  =  0 
when  i  =  j,  with  a  similar  condition  on  the  bu.  We  can  then  define  a  generalized 
correlation  coefficient  rG  by  the  equation 

(11.15.1)  rc=         £(ayby) 


(E  «.•/•!  V)"2 

the  sums  being  over  all  values  of  i  andy  from  1  to  N  (i  #  j). 

If  we  define  atJ  as  + 1  when  the  X-rank  of  the  ith  item  exceeds  that  of  the/h 
item  and  —  1  in  the  contrary  case,  then  atJ2  will  always  be  1  for  i  ^  j,  and 
£  atj2  will  be  A^N  —  1),  the  total  number  of  ordered  pairs.  The  same  holds  for 
£  btj2.  But  £  (aijbij)  is  twice  S,  since  each  pair  is  counted  twice,  once  in  the 
order  ij  and  once  in  the  order  ji.  (Each  term  in  the  sum  is  +  1  if  au  and  btj  are 
both  +1  or  both  —1,  and  is  —1  if  they  have  different  signs.)  It  follows  that 
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rG  is  equal  to  rK  as  defined  by  Eq.  (1 1.14.2).  If  x(  is  the  X-rank  of  item  number 
i  and  yt  its  Y-rank,  and  if  in  Eq.  (1)  we  let 

(11.15.2)  aij  =  xi-xj,     bij  =  yi-yj 

we  arrive  at  the  Spearman  coefficient  rs.  To  show  this,  we  note  first  that  xt  and  yt 
both  run  through  the  set  of  integers  from  1  to  N,  so  that 

(11.15.3)  E*i=I>*=WV  +  l) 

Now 

E  (aubu)  =  E  (*i  -  »iXyi  -  yj) 

=  E  ^^i  +  E  *jyj  -  E  (xtyj  +  */hi) 

U  U  ij 

The  first  two  terms  on  the  right-hand  side  are  each  equal  to  N  E  *tVi«  The  third 
term  is  the  same  as  —  2(E,-  *»)(E./  ^y)»  s^nce  '  an^7  can  t>e  interchanged.  There- 
fore, by  Eq.  (3), 

(11.15.4)  £  (fl,,by)  =  2JV  J  x*.  -  2  [^^  1}] ' 
Putting  xt  =  yt  we  obtain 

(11.15.5)  X  "J/  =  I  V  =  2N  I  x,2  -  N2(JV,+  1)2 


Since  E  **2  *s  the  sum  of  the  squares  of  the  integers  from  1  to  N,  which  is 
N(N  +  \)(2N  +  l)/6,  the  denominator  of  rG  is 

ai.15.6)  (zvEvr^2(N+fN  +  1)-^^ 

_  JV2(N2  -  1) 

"  6 

Now,  if  we  write  dt  =  jcf  —  yh 

(11.15.7)  E^2=E(V  +  J'i2-2xiyi) 


i=l 


2E^2-2E^- 

N(N  +  1)(2N  +  1) 


-  2  E  Xitt 


3 
Substituting  this  expression  in  Eq.  (4),  we  obtain 

Dividing  by  Eq.  (6),  we  find  that  rG  =  rs. 
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Lastly,  if  the  scores  au  and  btj  are  based  on  measured  values  xh  yi9  and  if 
au  =  xi  —  xj>  bij  =  yt  —  yjt  the  numerator  of  rG  is 

(11.15.9)  X  (x,  -  xj)(yi  -  yj)  =2N£  x^  -  2  £  x^  y\ 

=  2N(N  -  l)sXY 

where  sXY  is  the  sample  co variance  of  X  and  Y.  In  the  same  way,  the  denomi- 
nator of  rG  is  equal  to  2N(N  —  l)sxsY,  and  therefore  rG  reduces  to  the  ordinary 
Pearson  coefficient.  Thus  the  Spearman  coefficient  is  simply  the  Pearson 
coefficient  calculated  as  if  the  ranks  were  the  actual  variates. 

11.16  Tied  Ranks  In  practice  it  is  often  difficult  to  distinguish  between  two 
or  more  individuals  with  regard  to  the  attribute  considered,  and  they  are 
reckoned  as  "tied."  In  such  cases  the  tied  individuals  are  given  a  rank  which  is 
the  mean  of  the  ranks  they  would  have  had  if  they  had  been  distinguishable.  If, 
for  example,  the  3rd  and  4th  items  are  tied,  they  are  both  given  the  rank  3^.  If 
the  3rd,  4th  and  5th  are  tied  they  are  all  given  rank  4.  This  preserves  the  sum  of 
ranks  but  reduces  somewhat  the  sum  of  squares,  as  compared  with  a  ranking  in 
which  there  are  no  ties. 

If  in  the  Cranking  there  are  t  items  tied  and  in  the  Franking  u  items  tied,  the 
denominator  of  Kendall's  coefficient  in  Eq.  (11.14.2)  is  replaced  by 

(11.16.1)       [iN(N  -  1)  -  it(t  -  l)]1/2[iN(N  -  1)  -  \u{u  -  1)]1/2 

and  if  there  are  several  sets  of  ties  the  appropriate  amount  is  subtracted  for  each 
such  set.  The  numerator  is  calculated  as  before,  all  tied  pairs  contributing  zero 
to  the  total  score  S.  Thus,  suppose  the  ranks  are  as  given  in  the  following  table : 

Example  9 

Table  11.5 


Xi 

yi 

1  2\     2\     4 

2  14      4 

5       7 

n  9 

7     7     9 
10    4    6 

10 

rn 

8     7|     6      5i 

2\     1 

\    2     1 

0 

There  are  two  sets  of  ties,  with  t  =  2  and  t  =  3,  in  the  first  ranking  and  two 
sets,  witji  u  =  2  and  u  =  3,  in  the  second  ranking.  The  expression  (1)  is 
(45  -  1  -  3)1/2(45  -  1  -  3)1/2  =  41.  In  the  shorter  method  of  calculation 
of  5,  any  number  to  the  right  of  yt  and  equal  to  it  counts  \  towards  n{  and  any 
number  (greater  or  smaller)  counts  \  if  its  x  rank  is  the  same.  Then  P  =  34  and 
S  =  23,  giving  rK  =  23/41  =  0.56. 

In  Spearman's  method,  the  formula  as  corrected  for  one  set  of  /  ties  in  X  and 
one  set  of  u  ties  in  Y  is 

N3  -  N  -  6  X  d2  -  j(t3  ~  0  -  i(u3  -  u) 
(11.16.2)  rs  -  ^3  _  N  _  ^3  _  0]1/2[-iV3  _  N  _  {u3  _  u)y/2 

and  each  set  of  ties  gives  a  separate  correction. 
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In  Example  9,  £  d2  =  49,  and  TV3  -  TV  =  990.  Therefore,  with  t  =  2  and 
3,  and  u  =  2  and  3, 

_  990  -  294  -  3  -  12  -  3  -  12 
rs~  990  -  6  -  24 

666        r,     „ 

960 
The  uncorrected  value  is  0.70. 

11.17  Significance  of  the  Rank  Correlation  Coefficient  The  rank  correlation 
coefficient  can  be  used  as  a  test  of  the  association  between  X  and  7,  and  has  the 
advantage  over  the  Pearson  coefficient  of  not  requiring  the  assumption  that  one 
or  both  variables  are  normally  distributed.  If  we  suppose  that  in  the  parent 
population  X  and  Y  are  independent  of  each  other,  then  if  the  TV  items  in  a 
sample  are  placed  in  their  natural  order  of  ranking  for  X  (i.e.,  1,  2,  3  .  .  .  TV),  the 
order  of  ranking  for  Y  is  equally  likely  to  be  any  one  of  the  TV!  permutations  of 
the  numbers  1  to  TV.  (For  the  present,  we  are  assuming  that  there  are  no  ties.) 
For  small  values  of  TV  it  is  possible  to  calculate  for  each  of  these  permutations  the 
corresponding  value  of  rK  and  so  form  a  probability  distribution.  Thus  if  TV  =  5, 
there  are  120  possible  rankings  for  Y,  and  these  give  the  following  possible  values 
of  S  and  of  rK  (different  rankings  may  correspond  to  the  same  value  of  S  in 
(11.14.2)). 

Table  11.6 


s 

10 

8 

6 

4 

2 

0 

-2 

-4 

-6 

-8 

-10 

Yk 

1.0 

0.8 

0.6 

0.4 

0.2 

0 

-0.2 

-0.4 

-0.6 

-0.8 

-1.0 

f 

1 

4 

9 

15 

20 

22 

20 

15 

9 

4 

1 

A  frequency  polygon  drawn  for  the  distribution  of  S  lies  fairly  close  to  a  normal 
curve,  and  in  fact  it  may  be  proved  that,  as  TV  increases,  this  distribution  tends  to 
normality,  with  variance  TV(TV  -  1)(2TV  +  5)/18.  For  TV  >  10  the  approximation 
is  quite  good. 

From  the  above  table  it  is  evident  that  the  probability,  when  TV  =  5,  of  a 
value  of  rK  numerically  as  high  as  0.8,  when  X  and  Y  are  independent,  is 
10/120  =  0.083,  so  that  even  a  value  as  high  as  this  is  not  significant  at  the  5% 
level.  For  a  sample  of  10,  the  standard  deviation  of  S  is  (125)1/2  =  1 1.2,  so  that 
a  significant  value  at  the  5%  level,  obtained  from  the  normal  approximation 
would  be  1.96  x  11.2  ==  22.0.  This  corresponds  to  a  value  of  rK  =  +0.49.  The 
exact  probability  for  \S\  >  23  is  0.046,  a  little  under  5%. 

In  estimating  the  significance  of  an  observed  S  we  should  make  a  correction 
for  continuity  similar  to  that  made  in  replacing  the  binomial  distribution  by  a 
normal  distribution.  Since  the  distribution  of  S  is  discrete,  successive  values 
differing  by  2  units,  the  observed  S  should  be  replaced  by  S  —  1  if  S  is  positive 
or  by  S  +  1  if  it  is  negative. 
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11.18 


Spearman's  coefficient  rs  also  tends  to  normality  as  N  increases,  but  more 
slowly  than  rK.  Whether  there  are  ties  or  not,  the  variance  of  £  d2  is  given  by 

<N3 


(11.17.1) 

and  that  of  rs  by 

(11.17.2) 


V<Ld 


h^t 


l 

A/-1 


V(rs) 


N  -  1 


When  N  >  20  the  distribution  may  be  taken  as  approximately  normal. 

Down  to  somewhat  lower  values  of  N  (say  10)  the  distribution  of 
rs[(N  —  2)/(l  -  rs2)]l/2  is  approximately  that  of  Student's  t  with  N  -  2  degrees 
of  freedom.  This  is  the  same  as  the  distribution  which  was  shown  earlier  to 
hold  exactly  (on  certain  assumptions)  for  Pearson's  coefficient  in  a  sample  from 
an  uncorrelated  parent  population. 

11.18  Contingency  Tables  Often  in  medical,  biological,  psychological  or 
econometric  research  we  encounter  characteristics  or  attributes  which  we  cannot 
measure  accurately  and  according  to  which  we  may  not  even  be  able  to  rank  the 
individuals  of  a  sample,  but  which  do  permit  us  to  divide  the  sample  into 
classes  and  count  the  numbers  in  each  class.  We  might,  for  example,  classify 
a  sample  of  women  students  by  the  color  of  their  hair,  as  "fair-haired,"  "red- 
haired,"  "brown-haired"  or  "black-haired,"  or  a  sample  of  housewives  by  their 
place  of  residence  as  "rural"  or  "urban." 

A  frequency  table  in  which  a  sample  is  classified  according  to  two  different 
attributes  (whether  quantitative  or  not)  is  called  a  contingency  table.  It  looks 
rather  like  a  correlation  table,  except  that  the  columns  and  rows  do  not  neces- 
sarily correspond  to  any  numerical  values  of  the  attributes  X  and  Y.  If  a  sample 
of  N  is  divided  into  s  A'-classes  (denoted  by  Xi9  X2  .  .  .  ^s)  and  into  t  Y-classes 
(denoted  by  Yu  Y2  .  .  .  Yt),  the  frequency  ftJ  of  individuals  falling  into  class  Xt 
and  also  into  class  Yj  is  entered  in  the  /th  row  and  jth  column  of  the  table. 
Thus,  for  s  =  4  and  t  =  3, 

Table  11.7 
Y 


X\ 


(X, 

Xz 
X3 

\^X4 


Yi 

r2 

Y3 

/ii 

/I. 

/l3 

r\ 

/■i 

/22 

/23 

rz 

/ai 

/32 

/33 

n 

/« 

A  2 

A* 

r4 

Ci 

C2 

C3 

A^ 
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The  marginal  total  for  the  ith  row  is  rt  =  Yjfij  an<^ tnat  f°r  the  7th  column  is 

cj  =  Hi  fir  The  grand  total  is  N  =  Z  ri  =  E  cj- 

We  may  assume  that  in  the  population  there  is  a  probability  n^  that  an 

individual  selected  at  random  will  fall  in  classes  Xt  and  Yy  The  relative  fre- 
quency fij/N  will  be  an  approximation  to  n tj.  If  X  and  Y  are  independent  we 
shall  have 

(11.18.1)  Ttij  =ni'7ij 

where  71 ,-  (=  Y,jnu)  *s  tne  probability  of  Xt  regardless  of  the  Y  classes,  and 
Uj  (=  Y^i  nij)  *s  tne  probability  of  F,-  regardless  of  the  X  classes.  We  can  define 
the  mean  square  contingency,  which  is  a  measure  of  the  degree  of  association 
between  X  and  Y  in  the  population,  by 

(11.18.2)  ^=yfLL_1 

This  is  0  if  and  only  if  X  and  Y  are  independent.  Its  greatest  possible  value  is 
q  —  1  where  q  is  the  smaller  of  the  numbers  s  and  /  (or  their  common  value  if 
they  are  equal).  The  quantity  <f)2l(q  —  1)  may  therefore  be  used  as  a  measure  of 
the  degree  of  association,  and,  like  r2,  it  varies  between  0  and  1. 

The  expected  frequency  in  the  1,7th  cell  of  the  contingency  table  is  Nnij9  and 
the  deviation  of  the  table  from  expectation  can  be  measured  by  calculating  the 
quantity 

(11.18.3)  X,2=Z(fu~Nn,j)2 

the  sum  being  extended  over  all  the  cells  of  the  table.  Since  we  usually  wish  to 
test  the  hypothesis  that  A' and  Fare  independent,  we  can  replace  ntj  by  n^p  but 
these  marginal  probabilities  are  unknown,  and  must  be  estimated  from  the 
sample.  It  is  natural  to  estimate  7c£  and  Uj  by  the  relative  marginal  frequencies 
rJN  and  cJN  respectively,  and  in  fact  these  are  the  estimators  given  by  the 
method  of  maximum  likelihood. 

The  likelihood  of  the  observed  sample  of  TV  picked  from  the  assumed 
population  is  given  by 

(11.18.4)  L=n(^/° 

ij 

where  the  7rf  and  ni  are  subject  to  the  restrictions  £  ni  =  1,  £  itj  =  1.  Using  the 
method  of  Lagrange  multipliers  (Appendix  A.  15)  and  maximizing  logL  — 
X  £  nt  —  \i  £  Up  we  obtain  the  relations 

(11.18.5)  ^  -  A  =  0,    ^-^=0 

It  follows,  since  £  rt  =  £  ci  =  N,  that  ri  =  Nfti  and  Cj  —  Nfty 


316  INTRODUCTION  TO  STATISTICAL  INFERENCE  11.18 

Using  these  estimators,  we  can  take  the  expected  frequency  in  the  i,jth  cell  as 


(11.18.6) 

*il 

N  N 

_  TfCj 

N 

Therefore, 

(11.18.7) 

r2      yifu-tu)2 

=1^-2  !/«  +  !*« 

U    Yij 

f-2 
U    <Pij 

=  N 

\     f-2        1 
I  — -1 

It  is  shown  in  Appendix  A.  17  that  this  quantity  has  approximately  the  x2 
distribution  (discussed  in  §4.6)  with  (s  —  \)(t  —  1)  degrees  of  freedom.  Since 
the  population  probabilities  7i,  and  Uj  are  estimated  from  the  marginal  fre- 
quencies rt  and  cp  we  must  assume  that  in  all  samples  these  marginal  fre- 
quencies remain  constant.  The  observed  frequencies  fu  in  the  contingency  table 
can  therefore  be  varied  only  in  (s  —  1  )(t  —  1 )  of  the  cells,  and  when  these  are 
filled  the  frequencies  in  the  remaining  cells  are  automatically  determined.  The 
distribution  of  the  N  observations  among  the  cells  is  then  multinomial  (Appendix 
A.  16),  subject  to  the  restriction  just  mentioned. 

Since  the  %2  distribution  applies  strictly  only  in  the  limiting  case,  as  the 
expected  frequencies  increase  indefinitely,  the  approximation  should  not  be 
used  if  some  of  these  frequencies  are  very  small.  Some  investigations  (see  [10]) 
suggest  that  a  minimum  value  of  <j>tj  as  low  as  1  may  be  tolerated  if  values  below 
5  do  not  occur  in  more  than  about  20%  of  the  cells. 

If  a  larger  proportion  of  the  cells  have  expected  frequencies  below  5,  it  is 
wise  to  use  an  exact  method  [11]. 

Example  10  Table  11.8  gives  some  results  obtained  by  Woo  (Biometrika, 
1928)  on  the  association  between  "left-handedness"  and  "left-eyedness." 
The  ^-categories  are  left-handed,  ambidextrous  and  right-handed,  the  Y- 
categories  left-eyed,  ambiocular,  and  right-eyed. 

The  number  of  degrees  of  freedom  is  2  x  2  =  4,  so  that  this  value  of  x2  is 
certainly  not  significant.  The  hypothesis  that  there  is  no  association  between  the 
attributes  A  and  B  is  not  rejected. 

The  sample  statistic  which  corresponds  to  the  mean  square  contingency  <f)2 
defined  in  Eq.  (11.18.2)  is 

(11.18.8)  p=l(—)-l=lJ 

Tj  \r,Cj/  N 
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This,  divided  by  q  —  1  (q  being  the  lesser  of  s  and  t)y  is  a  measure  of  the 
degree  of  association  indicated  by  the  sample.  The  upper  limit  1  is  attained  if 
(for  s  <  t)  each  column  contains  just  one  non-zero  frequency,  or  (for  s  >  i)  if 
each  row  contains  just  one  non-zero  frequency.  The  quantity  C  =  [f2/(q  —  1)]1/2 
is  a  coefficient  of  contingency. 

Table   11.8 

Yi     y2     y3 


Xi 

x% 

Xz 


Here    xs2  =  413[(34)2/(124  x  118)  +  (62)2/(124  x  195)  +  .  .  . 
+  (52)2/(214  x  100)  -  1]  =  4.02. 

In  Example  10  above,  q  =  3  and/2/(#  -  1)  =  4.02/826  =  0.0049,  so  that 
C  =  0.07.  There  is  very  little  association  between  X  and  Y. 


34 

62 

28 

124 

27 

28 

20 

75 
214 

57 

105 

52 

118 

195 

100 

413 

11.19  The   Contingency  Table  with  Two  Rows  or  Two   Columns     For  a 

2  x  n  table  the  calculation  of  %s2  may  be  somewhat  simplified.  As  shown  by 
Brandt  and  Snedecor,  the  value  for  a  sample,  with  frequencies  as  given  in  the 
following  table,  is : 

(11.19.1) 


where  Cj  =  dj  +  bj. 


Yx 


rir2\i  cj        Nj 


Table  11.9 
r2        Y3 


Yn 


Xi 

X2 


a\ 

ai 

03 

an 

fl 

bi 

b2 

bs 

bn 

Yi 

Ci 

Cl 

C3 

Cn 

N 

Either  row  in  the  table  can,  of  course  be  chosen  as  the  dj. 

Example  1 1  (Lindstrom)  The  variable  X  is  the  presence  (or  absence)  of  a 
sugar-producing  gene  in  ears  of  corn,  Y  is  the  number  of  rows  of  kernels  in  the 
ear. 
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Table  11.10 
No.  of  Rows  of  Kernels 
8  10  12  14 


11.19 


18 

37 

27 

0 

82 

15 

26 

43 

4 

88 

33 

63 

70 

4 

170 

( Present 
(Absent 


Since  the  numbers  in  the  last  column  ( Y  =  14)  are  so  small,  it  is  better  to  group 
the  last  two  columns  together.  We  then  have  a  2  x  3  table,  and  Eq.  (1)  gives 


(170)2 
"  (82X88) 
=  7.39 


["18^      37^      27^  _  8221 
|_^3"  +  63  +  74  ~  170  J 


With  2  d.f.,  the  probability  of  a  value  as  large  as  this  is  about  0.025,  so  thas 
association  between  presence  of  the  sugar  gene  and  few  rows  of  kernels  it 
definitely  indicated,  although  not  strongly  so. 

The  value  of  the  contingency  coefficient  C  is  0.21,  which  is  fairly  high  for 
this  coefficient. 

With  a  2  x  2  table  the  calculation  of  xs2  is  still  simpler.  If  the  frequencies  in 
the  four  cells  are  denoted  by  a,  b,  c,  d,  the  value  of  xs2  is  given  by 


7£ 

N 


a1\rx+c1\r1    |  b2\rx+d2\r2       { 


Yi 


r2 


a 

b 

ri 

c 

d 

r% 

Cl 

C2 

N 

Xi 

X2 


On  substituting  a  +  b  for  rl9  etc.,  and  carrying  out  some  algebraic  manipulations, 
this  becomes 


(11.19.2) 


Xs 


N(ad  -  bc): 
rir2CiC2 


Example  12  A  drug  said  to  prevent  sea-sickness  was  tested  as  follows: 
25  men  were  given  the  drug  and  25  others  were  not.  Both  groups  were  tested 
in  a  rocking  machine  and  the  numbers  who  became  sick  were  noted.  The 
results  were  as  shown : 
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Sick         Not  Sick 
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With  drug 
Without  drug 


10 

15 

25 

19 

6 

25 

29 

21 

50 

Xs      = 


50-(60-285): 
25-25-29-21 


=  6.65 


The  probability  of  a  value  as  high  as  this  with  1  d.f.  is  a  little  less  than  0.01,  so 
that  an  association  between  the  use  of  the  drug  and  immunity  from  sickness  is 
pretty  definitely  indicated.  The  coefficient  of  contingency  is  0.37. 

Note  that  with  one  degree  of  freedom,  the  distribution  of  xs  (the  square  root 
of  Xs2)  is  normal.  The  probability  can  therefore  be  found  from  a  table  of  the 
normal  law,  using  both  tails. 

11.20  The  Yates  Correction  for  Continuity  The  cell-frequencies  in  a  con- 
tingency table  are  necessarily  integers,  so  that  xs2  is  a  discrete  variate,  whereas  x2 
varies  continuously.  The  situation  is  something  like  that  in  approximating  a 
binomial  distribution  by  a  continuous  normal  distribution,  where  the  sum  of 
terms  from  x  =  a  to  x  =  b  (inclusive)  is  approximated  by  an  integral  from 
a  -  ito6  +  i(see  §3.11). 

In  the  2x2  table,  as  pointed  out  by  Yates,  the  approximation  to  x2  is  much 
improved  by  replacing  d  by  d  —  \  or  d  +  ^,  according  as  ad  >  be  or  ad  <  be, 
and  adjusting  the  other  frequencies  so  as  to  keep  the  marginal  totals  constant. 
The  effect  of  this  is  to  replace  (ad  -  be)2  in  Eq.  (1 1.19.2)  by  (\ad  -  bc\  -  N/2)2, 
and  thereby  to  reduce  somewhat  the  apparent  significance  of  the  result. 

In  Example  12,  above,  the  rearranged  table  would  be  as  shown.  Since 
[(10i)(6i)  -  (14i)(18i)]2  =  (200)2  =  (225  -  25)2,  the  value  of  X2,  with  the 
correction,  is  reduced  to  5.25,  the  probability  for  which  is  more  than  0.02. 


10^ 

14£ 

25 

18| 

6£ 

25 

29 

21 

50 

If  the  total  frequency  TV  <  20,  or  if  20  <  N  <  40  and  the  smallest  expected 
frequency  is  less  than  5,  it  is  better  to  use  Fisher's  exact  method,  given  in  the 
next  section. 

*  11.21  Fisher's  Exact  Method  for  2  x  2  Tables     Fisher  has  pointed  out  that 
exact  probabilities  can  be  calculated  for  all  the  possible  2x2  tables  which  have 
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the  same  set  of  marginal  frequencies.  Let  the  observed  cell-frequencies  be  a,  b, 
c,  d,  and  suppose  the  table  so  arranged  that  d  is  the  smallest  of  these.  The 
distribution  of  the  N  items  in  the  sample  among  the  four  cells  (on  the  hypothesis 
that  there  is  no  association  between  X  and  Y)  is  a  hypergeometric  one  (see 
§  3.5).  It  has  the  following  mathematical  model:  Given  N  balls  in  an  urn,  of 
which  rx  are  black  (corresponding  to  Xx)  and  r2  are  white  (corresponding  to  X2), 
and  given  N  boxes  of  which  c±  are  red  (for  Yx)  and  c2  are  green  (for  Y2),  with- 
draw the  balls  one  at  a  time  and  place  them  at  random  in  the  boxes,  one  ball  to  a 
box.  The  number  of  black  balls  in  red  boxes  will  be  a,  and  similarly  for  the  other 
frequencies. 

The  probability  that  there  are  just  b  black  balls  and  d  white  ones  in  the  c2 

green  boxes  —  I  l  )  (  i  I  /  I     )  smce  tne  numerator  is  the  number  of  ways  of 

choosing  b  black  balls  out  of  ri  and  d  white  balls  out  of  r2,  while  the  denomi- 
nator is  the  total  number  of  ways  of  picking  c2  balls  out  of  the  urn.  But  once  the 
green  boxes  have  been  filled,  the  numbers  a  and  c  for  the  red  boxes  are  fixed, 
since  a  =  rx  —  b  and  c  —  r2  —  d.  The  probability  of  the  whole  observed  set  of 
frequencies  a,  b,  c,  d,  is,  therefore, 

di.21.1)  M  =  {bjU) 


0 


a\b\c\d\N\ 


Now  the  theoretical  frequency  S,  corresponding  to  d,  is  r2c2/N.  If  d  <  <5,  all 
smaller  values  of  d  down  to  zero  will  be  even  less  likely  than  d  itself.  We  can 
therefore  calculate  the  probability  P  of  the  observed  distribution  and  of  all  less 
likely  ones  in  one  direction,  given  by 

(11.21.2)  P=tpW 

d  =  0 

where  dx  is  the  observed  d.  (The  value  of  d  determines  the  whole  table,  the 
marginal  frequencies  being  fixed.)  If  dv  >  S,  the  sum  will  go  from  d  =  dx 
up  to  d  =  c2,  since  we  are  now  concerned  with  values  of  d  larger  than  the 
expected  one.  The  P  so  calculated  corresponds  to  one  tail  of  the  distribution, 
whereas  the  y2  method  takes  account  of  both  tails.  It  is  to  be  expected,  therefore, 
that  the  probability  calculated  from  xs2  will  be  an  approximation  to  2P  and  not 
to  P. 

In  Example  12  above,  d  =  6  and  S  =  10.5,  so  that 

d  =  0 

The  values  of  p(d)  are  given  in  the  following  table: 


d 

0              12              3              4,5,6 

p(d) 

0.00860   0.00161    0.00020   0.00002   0.00000 
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and  we  find  that  P  =  0.0104.  The  x2  value,  with  Yates's  correction,  gives 
P  =  0.022,  which  is  quite  close  to  IP. 

The  chief  objection  to  Fisher's  method  is  the  considerable  amount  of 
computation  usually  involved.  Tables  recently  published  by  Mainland  and 
others  [12]  enable  the  significance  (at  5%  and  1  %  levels)  to  be  estimated  very 
quickly,  without  calculation. 

An  alternative  procedure  is  to  use  a  normal  approximation  with  a  con- 
tinuity correction.  The  variance  of  d,  as  given  by  Eq.  (3.5.7),  is 


(11.21.3) 
and  if 
(11.21.4) 


\/(A\      N  ~c2  n     r2  L       r2\         CjCJW 

yw=ir^'C2'NV-NrW(N^ 


1) 


*l 


[v(d)Y/: 


is  treated  as  a  normal  variate,  the  probability  of  a  value  at  least  as  great  as  this 
can  be  found. 

Thus,  in  Example  12,   \d  -  S\  -  ±  =  4,   V(d)  =  3.11,  so  that  z  =  4/1.76 
=  2.27,  giving  P  =  .0116.  This  is  fairly  close  to  the  exact  value. 


1 1 .22  The  Chi-Square  Test  as  a  Test  of  Homogeneity  It  sometimes  happens 
that  a  table  which  looks  like  a  contingency  table  really  reflects  a  different 
situation.  The  rows  of  the  table  represent  each  a  different  set  of  observations,  r{ 
in  number,  the  individuals  in  each  set  being  classified  according  to  the  attribute 
Y.  The  numbers  rt  are  selected  arbitrarily  and  do  not  depend  at  all  on  the 
population.  The  hypothesis  to  be  tested  is  that  each  sample  (represented  by  a 
row  of  the  table)  comes  from  the  same  population  in  which  the  probability  of 
attribute  Yj  is  tcj  (with  £  itj  =  1). 

The  value  of  nj  is  estimated  as  before  by  Cj/N.  It  may  be  shown  that  the 
limiting  distribution  of  %s2,  calculated  in  the  ordinary  way,  is  still  the  y2  distri- 
bution with  (s  —  \)(t  —  1)  degrees  of  freedom. 

Example  13  In  order  to  see  if  the  age-distribution  of  whitefish  in  Lake 
Wabamun,  Alberta,  had  changed  significantly  between  1957  and  1958,  samples 
of  the  catches  in  these  two  years  were  classified  in  age-groups  as  follows : 

Table  11.12 
Age  (Years) 


Year 

3-4  5 

6   7   8  >9 

1957 

6  15 

10  38  62  26 

157 

1958 

16  12 

9  22  36   5 

100 

22  27 

19  60  98  31 

257 
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Here 

2  _     (257)2      /16^      12^  5^  _  1002\ 

Xs    ~  (100)(157)  \  22  +  27  +  '  "  +  31       257/ 
=  18.6 

The  number  of  degrees  of  freedom  is  5,  so  that  this  value  of  x2  is  highly  sig- 
nificant. The  value  of/2  is  0.072,  giving  C  =  0.27. 

PROBLEMS 
A  (§§11.1-11.2) 

1.  If  the  joint  probability  density  for  X  and   Y  is  f(x,  y)  =  2/a2,  0  <  x  <  y, 

0  <  y  <  a,  find  (a)  the  marginal  probability  densities  g(x)  and  h{y)  (b)  the  regression 
equations  of  7  on  I  and  of  Jon  Y  (c)  the  means  and  variances  of  X  and  Y,  the 
co variance,  and  the  coefficient  of  correlation  between  X  and  Y.  Hint:  f(x,  y)  is 
constant  over  a  triangular  area  in  the  xy  plane. 

2.  Is  it  true  that  a  necessary  and  sufficient  condition  for  two  variates  X  and  Y  to 
have  a  bivariate  normal  distribution  is  that  the  two  regression  equations  are  linear? 
Hint:  See  Problem  1  above. 

3.  For  the  bivariate  normal  distribution,  Eq.  (1 1.2.5),  show  that  the  variance  of  rjz 
is  equal  to  p2  and  that  the  correlation  coefficient  between  i)Z  and  v  is  equal  to  p. 

4.  Show  that  the  coefficient  of  correlation  for  two  variates  X  and  Y  is  the  geometric 
mean  of  the  slopes  of  the  two  regression  lines,  one  reckoned  from  the  X-axis  and  one 
from  the  F-axis.  (The  geometric  mean  of  a  and  b  is  \/ab.) 

5.  Show  that  if  X  and  Y  are  independent  variates  they  are  necessarily  uncorrelated. 
(The  condition  for  independence  implies  that  E(XY)  =  E(X)E(Y).) 

6.  If  Xis  uniformly  distributed  on  (  —  1,1)  and  if  Y  =  X2,  show  that  Zand  Y are 
uncorrelated.  (Note  that  X  and  Y  are  certainly  not  independent.) 

7.  Prove  that  the  acute  angle  between  the  two  lines  of  regression  is  given  by  tan  6  = 

1  —  o2     oxoY 

8.  Let  the  variate  Zhave  the  marginal  distribution^*)  =  1,  —  \  <  x  <  £,  and  let 
the  conditional  density  of  Y,  given  X  =  x,  be/(y|*)  =  1,  x  <  y  <  x  +  \,  —$  <  x  < 
0,  f(y\x)  =  1,  —  x  <  y  <  \  —  x,  0  <  *  <  |,  and  f(y\x)  =  0  otherwise.  Prove  that 
X  and  Y  are  uncorrelated. 

9.  If  Xi,  X2,  X3  are  uncorrelated  variates,  each  with  the  same  standard  deviation  o-, 
find  the  coefficient  of  correlation  between  Xi  +  X2  and  Xz  +  X3. 

10.  If  X  and  Y  are  uncorrelated,  with  means  zero  and  variances  ax2,  o>2,  show 
that  the  variates  U  =  Zcos  a  +  Y  sin  a  and  V  =  X  since  —  Fcos  a  have  a  correla- 
tion coefficient 

=  ox2  -  gy2 

Pf7K    .   [{°x2  -  <7K2)2  +  Aox2aY2  cosec22a]!/2 
B  (§§11.3-11.6) 

1.  The  following  data  represent  the  ages  of  husband  (Z)and  wife  ( Y)  for  20  couples 
selected  at  random  from  a  certain  population. 


X 

22  24  26  26  27  27  28  28  29  30  30  30  31  32  33  34  35  35  36  37 

Y 

18  20  20  24  22  24  27  24  21  25  29  32  27  27  30  27  30  31  30  32 

Find  the  equations  of  the  two  regression  lines.   Make  a  scatter  diagram  for  the  data 
and  draw  the  two  lines  on  it. 
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2.  Calculate  the  coefficient  of  correlation  of  X  and  Y  from  the  data  of  Problem  B.  1 . 
On  the  assumption  that  the  population  is  bivariate  normal,  find  95  %  confidence  limits 
for  the  two  regression  coefficients,  /?  (for  the  first  regression  line)  and  fi'  (for  the  second). 

3.  In  studying  a  set  of  pairs  of  values  of  related  variates  X  and  Y,  a  statistician  has 
computed  the  following  quantities:  TV  =  100,  £>  =  12,500,  J>  =  8,000,  £  *2  = 
1,585,000,  J>2  =  648,100,  *£xy  =  1,007,425.  Calculate  x,  y,  sx,  sY,  sXY  and  rfor 
these  data. 

4.  In  the  following  table,  X  is  the  weight  (to  nearest  half  pound)  and  Y  the  height 
(to  nearest  tenth  of  an  inch)  for  200  freshmen  at  a  university. 


\  X 

90 
-99.5 

100 
-109.5 

110 

120 

130 

140 

150 

160 

170 

180 

190 

200 
-209.5 

76-77.9 

1 

74-75.9 

1 

1 

1 

1 

72- 

1 

1 

4 

1 

70- 

1 

2 

6 

7 

6 

2 

1 

2 

1 

1 

68- 

2 

8 

17 

8 

9 

2 

1 

1 

1 

66- 

8 

16 

14 

13 

6 

2 

1 

1 

64- 

3  ' 

8 

7 

7 

3 

3 

1 

1 

62- 

1 

4 

1 

7 

1 

60- 

58-59.9 

1 

Find  the  regression  equation  of  height  on  weight,  and  give  95  %  confidence  limits  for 
the  regression  coefficient  /?.  Calculate  the  correlation  coefficient  between  height  and 
weight. 

5.  A  coefficient  of  correlation  calculated  from  a  sample  of  size  25  is  found  to  be 
0.37.  Is  this  value  significantly  different  (at  the  5%  level)  from  zero? 

6.  A  sample  correlation  coefficient  of  0.561  is  said  to  be  highly  significant.  Assum- 
ing that  this  means  that  the  probability  of  getting  a  value  numerically  as  great  is  less 
than  0.01,  what  is  the  smallest  sample  size  that  would  warrant  the  statement?  Hint: 
r[(N  -  2)/(l  -  r2)]1/2  is  to  be  the  same  as  /0.oi  for  TV  -  2  d.f.  Find  N-2by  trial, 
using  the  table  of  t. 

7.  Is  it  true  that  a  correlation  coefficient  of  r  =  0.6  indicates  a  relationship  twice 
as  close  as  that  indicated  by  r  =  0.3  ?  Hint:  Consider  the  relative  accuracy  of  estimation 
of  Y  from  a  given  X  in  the  two  cases,  as  measured  by  the  reciprocal  of  the  standard  error 
of  estimate. 

8.  The  marks  of  a  class  of  12  students  on  a  mid-term  test  (x)  and  on  the  final 
examination  (>>)  were : 


X 

41 

45 

50 

68 

47 

77 

90 

100 

80 

100 

40 

43 

y 

60 

63 

60 

48 

85 

56 

53 

91 

74 

98 

65 

43 
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What  is  the  regression  estimate  of  the  final  mark  of  a.  student  who  obtained  60  on  the 
test  but  was  ill  at  the  time  of  the  final  examination  ?  What  is  the  standard  error  of  this 
estimate? 

9.  The  two  regression  lines  for  variates  X  and  Y  have  been  computed  as  Ax  —  5y  + 
33  =  0  and  20x  —  9y  —  107  =  0.  Given  that  the  variance  of  X  is  9,  calculate  the 
variance  of  Y,  the  means  for  X  and  Y  and  the  coefficient  of  correlation  between  X 
and  Y. 

C  (§§11.7-11.11) 

1.  The  following  table  gives  death-rates  per  100,000  in  the  United  States  from 
typhoid  fever  for  the  years  from  1900-1920: 


Year 

Rate 

Year 

Rate 

Year 

Rate 

1900 

31.3 

1907 

20.5 

1914 

10.8 

1901 

27.5 

1908 

19.6 

1915 

9.2 

1902 

26.3 

1909 

17.2 

1916 

8.8 

1903 

24.6 

1910 

18.0 

1917 

8.1 

1904 

23.9 

1911 

15.3 

1918 

7.0 

1905 

22.4 

1912 

13.2 

1919 

4.8 

1906 

22.0 

1913 

12.6 

1920 

5.0 

Find  the  best-fitting  straight  line  for  these  data.  If  the  linear  trend  had  continued, 
estimate  the  date  at  which  typhoid  fever  would  have  been  wiped  out  in  the  United 
States.  Hint:  Take  the  origin  of  X  (the  date)  at  1910,  so  that  the  values  of  x  are  — 10, 
-9,  etc. 

2.  In  the  following  table,  x  is  the  amount  of  irrigation  water  (inches)  applied  to 
an  experimental  farm  in  India  and  y  is  the  yield  of  rice  in  tons/acre  [13]. 


X 

12 

18 

24 

30 

36 

42 

48 

y 

5.27 

5.68 

6.25 

7.21 

8.02 

8.71 

8.42 

The  values  of  x  are  fixed  and  those  of  y  are  random.  Estimate  how  much  water  would 
be  necessary  for  a  yield  of  7.5  tons/acre  and  obtain  95%  confidence  limits  for  this 
estimate.  (Note:  An  approximation  for  moderately  large  N  to  the  standard  error  of 
estimate  for  x  is  given  by  sxe  =  sre/b.  This  is  much  simpler  than  solving  the  quadratic 
for  x  as  suggested  in  §  11.7.) 

3.  The  following  values  were  obtained  in  an  experiment  intended  to  show  a  linear 
functional  relationship  between  X  and  Y.  Both  variables  are  subject  to  error  and  it  is 
considered  that  the  variance  of  the  error  in  Y  is  16  times  that  of  the  error  in  X.  Obtain 
values  for  the  estimators  of  a  and  /3  in  the  linear  relation  rj  =  a  +  fig,  and  for  a  con- 
sistent estimator  of  a2,  the  variance  of  X. 


X 

l 

2 

3 

4 

5 

6 

7 

8 

9 

10 

y 

9.9 

13.2 

16.4 

19.7 

22.5 

26.1 

29.2 

32.5 

35.7 

38.8 

4.  Find  the  best-fitting  line  for  the  data  of  Problem  C.3  by  the  method  of  grouping, 
using  (a)  two  groups  of  5,  (b)  three  groups  of  3,  4,  and  3  respectively.  Find  90%  con- 
fidence intervals  for  j8  in  both  cases. 

In  method  (b),  obtain  an  estimator  for  au2,  assuming  that  av2  =  16  au2. 

D  (§§11.12-11.13) 

1.  [14]  Over  a  period  of  20  years  the  mean  wheat  yield  of  eastern  England  was 
found  to  be  correlated  with  the  autumn  rainfall,  with  r  =  —0.629.  Is  this  significantly 
different  from  zero  at  the  1  %  level?  Is  it  significantly  different  from  —0.3? 
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2.  In  a  sample  of  25  pairs  of  individuals  (parent  and  child),  the  correlation  in  a 
certain  character  was  found  to  be  0.60.  Obtain  90%  confidence  limits  for  the  popula- 
tion correlation  coefficient.  Could  one  conclude,  with  90%  confidence,  that  the  true 
value  (in  the  population  sampled)  was  at  least  0.40? 

3.  One  random  sample  of  28  from  a  certain  bivariate  population  gave  r  =  0.60; 
another  independent  random  sample  of  23  gave  r  =  0.40.  Is  the  difference  significant 
at  the  5%  level?  Hint:  Use  a  two-tailed  test  for  the  difference,  after  making  the 
Fisher  transformation. 

4.  For  a  sample  of  size  30  from  a  bivariate  normal  population,  r  was  found  to  be 
0.684.  An  independent  sample  of  size  40  gave  r  —  0.719.  What  estimate  would  you 
suggest  for  the  true  value  of  p  ? 

5.  Obtain  an  estimator  of  p  from  the  sampling  distribution  of  r  (Eqs.  11.12.7  and 
11.12.8)  by  finding  that  value  of  p  for  which  /(r,  p)  is  a  maximum,  with  a  given  r. 
Is  this  estimator  unbiased?  Hint:  Put  d(logf)/dp  =  0  and  solve  the  quadratic  equation 
for  p  as  far  as  terms  of  order  1/AT.  Neglect  all  terms  in  S(pr)  except  the  first. 

E  (§§11.14-11.17) 

1.  (Garrett)  Twelve  salesmen  were  ranked  in  order  of  merit  for  efficiency  (X)  by 
their  manager.  The  ranking  ( Y)  in  accordance  with  length  of  service  is  also  given  in 
the  following  table:  What  correlation  is  there  between  length  of  service  and  efficiency? 


Salesmen 

A 

B 

C 

D 

E 

F 

G 

H 

J 

K 

L 

M 

X 

6 

12 

1 

9 

8 

5 

2 

10 

3 

1 

4 

11 

Y 

7.5 

11.5 

2 

4 

6 

9 

1 

11.5 

5 

7.5 

3 

10 

Calculate  both  Spearman's  and  Kendall's  coefficient,  correcting  for  the  ties  in  the  Y 
ranking. 

2.  The  scores  of  10  students  on  two  tests  are  given  in  the  following  table.  Calculate 
the  Pearson  coefficient  of  correlation  for  the  actual  scores,  and  the  Spearman  coefficient 
for  the  ranks. 


Student 

A 

B 

C 

D 

E 

F 

G 

H 

/ 

K 

X 

92 

89 

87 

86 

83 

11 

71 

62 

53 

40 

Y 

88 

85 

93 

79 

70 

87 

52 

84 

41 

64 

3.  If  a  sample  of  seven  pairs  is  drawn  from  a  population  of  values  of  independent 
variates  X  and  Y,  it  is  known  that  the  computed  Spearman  coefficient  will  exceed 
0.714  in  not  more  than  5  %  of  cases  and  will  exceed  0.893  in  not  more  than  1  %  of  cases. 
What  conclusion  may  be  drawn  regarding  the  judges  in  Example  8  of  §  1 1.14?  Apply 
the  Student-/  approximation  of  §  11.17  to  the  same  problem. 

4.  In  a  drama  competition,  ten  plays  were  ranked  independently  by  two  adjudica- 
tors, as  follows : 


Play 

A 

B 

C 

D 

E 

F 

G 

H 

J 

K 

Rank  (X) 

5 

2 

6 

8 

1 

1 

4 

9 

3 

10 

Rank(F) 

1 

1 

6 

10 

4 

5 

3 

8 

2 

9 

Calculate  the  coefficient  of  rank  correlation  by  both  the  Spearman  and  Kendall  form- 
ulas. Would  you  say  that  there  is  a  significant  measure  of  agreement  between  the  two 
adjudicators?  Hint:  Use  the  normal  approximation  to  the  variance  of  S,  and  make  the 
correction  for  continuity. 
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F  (§§  11.18-11.22) 

1.  In  the  accompanying  contingency  table,  X  represents  a  rating  given  to  each  of  a 
group  of  university  freshmen  on  the  basis  of  high  school  reports  and  Y  represents  the 
final  standing  in  degree  examinations  for  the  same  group.  Discuss  the  association 
between  these  two  attributes,  and  calculate  the  coefficient  of  contingency  C  defined 
in  §  11.18. 


\       X 

Y  \^ 

Fair 

Good 

Excellent 

3rd  class 

73 

67 

10 

2nd  class 

64 

84 

15 

1st  class 

5 

24 

28 

2.  In  a  public  opinion  survey  the  following  questions  were  asked:  (1)  Do  you  drink 
beer?  (2)  Are  you  in  favour  of  local  option  on  the  sale  of  liquor?  In  one  district  the 
results  (excluding  those  who  had  no  opinions)  were  as  indicated. 


Drinkers 
Non-drinkers 


For  Local 
Option 


Against 


18 

45 


39 

37 


Does  this  provide  good  evidence  of  an  association  between  drinking  habits  and  opinion 
on  the  subject  of  local  option  ? 

3.  Two  batches  of  12  experimental  animals,  one  batch  inoculated  and  the  other 
not  inoculated,  were  exposed  to  infection  under  comparable  conditions.  Of  the  inoc- 
ulated group,  2  died  and  10  survived;  of  the  other  group  8  died  and  4  survived. 
Does  this  observation  provide  evidence  (at  the  5  %  level  of  significance)  of  the  value  of 
the  inoculation  in  increasing  the  chances  of  survival  when  exposed  to  infection? 
Hint:  Calculate  the  probability  of  a  result  at  least  as  extreme  as  that  actually  observed 
(a)  by  the  x2  test>  using  Yates's  correction,  and  (b)  by  Fisher's  exact  method. 

4.  In  a  certain  community  a  random  sample  of  50  men  and  50  women  over  21 
years  of  age  were  asked  about  their  educational  background,  classified  as  junior  high, 
senior  high  or  college.  The  results  were : 


Junior  High    Senior  High 

College 

Male 

13                    25 

12 

Female 

23                    20 

7 

Is  there  a  significant  association  between  sex  and  educational  level?   Hint:  Use  the 
Brandt-Snedecor  formula. 

5.  Two  groups  of  freshmen  applying  to  enter  a  university  took  the  same  college 
aptitude  test.  The  groups  (A  and  B)  differed  in  the  type  of  high  school  education 
they  had  experienced.  The  frequency  distributions  of  scores  for  the  two  groups  were 
as  follows: 


Score 

0-9 

10-19  20-29  30-39  40-49  50-59  60-69  70-79  80-89  90-99 

Group  A 
Group  B 

71 

22 

68        66        47        51         39        43        39        33         18 
8         14         12          3         13          3         14         12         10 
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Calculate  the  value  of  x2  and  determine  whether  there  is  a  significant  difference  in 
college  aptitude  between  the  groups. 

6.  Prove  that  if  ad  <  be,  then  x2  for  the  table 


* 


d+  \ 


is  given  by  N(\ad  —  bc\  —  N/2)2/(nr2CiC2)  where  n  =  a  +  b,  r%  =  c  +  d,  ci  =  a  +  c, 
C2  =  b  +  d. 

7.  It  has  been  suggested  by  V.  M.  Dandekar  (see  [13],  p.  388)  that  a  better  approx- 
imation than  that  given  by  Yates,  to  the  true  probability  P  for  a  2  x  2  table,  may  be 
obtained  by  subtracting  from  the  uncorrected  value  xo2  the  term  (x-i2  —  Xo2)(Xo2  —  Xi)2/ 
(x_!2  —  xi2)  where  xi2  and  x-i2  are  the  values  obtained  by  respectively  increasing  and 
decreasing  the  smallest  frequency  in  the  table  by  unity.  Test  this  suggestion  on  the 
data  of  Problem  F-3. 

8.  Prove  the  Brandt-Snedecor  formula  for  x*2,  Eq.  (11.19.1). 
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Chapter  12 

REGRESSION  ANALYSIS  AND 
CURVE  FITTING 

1 2. 1  The  Equations  of  Multiple  Regression  In  the  last  chapter  we  considered 
the  relations  between  two  variates  X  and  Y.  We  now  suppose  that  Y  depends 
on  p  other  variates  which  will  be  denoted  by  Xl9  X2  .  . ..  Xp.  These  need  not  be 
independent,  and  in  fact  may  all  be  powers  of  a  single  variate  X.  We  shall  call 
Xl  .  .  .  Xp  the  predictors  and  Y  the  predicted  (or  dependent)  variate.  The  usual 
problem  is  to  find  the  best  linear  predicting  equation  for  Y  (in  the  least  squares 
sense)  of  the  form : 


(12.1.1 


yc 


Z  Mi. 


0,  1,  2  .  .  .  p 


To  avoid  introducing  a  separate  constant  term,  the  first  variate  X0  is  a 
dummy  which  always  takes  the  value  1 .  The  coefficient  b0  is  then  the  constant, 
denoted  in  Chapter  11  by  a,  and  bx  is  the  previous  b.  The  results  of  this  chapter 
reduce  to  those  of  Chapter  1 1  when  p  =  1 . 

The  coefficients  bt  are  called  partial  regression  coefficients.  They  are  esti- 
mators of  the  true  regression  coefficients  pt  which  are  supposed  to  characterize 
the  population,  and  they  are  calculated  from  a  set  of  observations  of  each  of 
the  p  +  1  variates  made  on  N  individuals  from  the  population.  We  shall  denote 
the  observed  value  of  Xt  for  the  individual  numbered  a  by  xix,  (/  =  1,2.../?, 
a  =  1,  2  ...  N).  The  set  of  all  N(p  +  1)  observations  may  be  written  as  a 
matrix — 


X  1  1  X  1  2  •  •  •  X 


XiiXr 


IN 


Xp\XP2 

y\  yi 


•pN 


yN 


with  p  +  1  rows  and  N  columns  and  much  of  the  material  in  this  chapter  is 
most  conveniently  expressed  in  the  notation  of  matrix  algebra.  For  those  who 
are  unfamiliar  with  this  subject,  a  brief  discussion  of  the  principal  ideas  will 
be  found  in  the  Appendix,  §§  A.  18-23. 
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The  true  regression  equation  in  the  population  is  supposed  to  be 

(12.1.2)  f=Ifc,        *=0,l,2...i> 

i 

The  xt  are  fixed  numbers,  or  at  least  the  errors  in  xt  are  small  compared  with 
the  error  in  y.  The  bt  will  therefore  be  chosen  to  minimize  the  sum  of  squares 
of  the  differences  between  the  observed  ya  and  the  theoretical  na.  We  shall  use 
the  symbol  ]T  to  denote  summation  over  variates  with  respect  to  i  (sometimes 
j  or  k)  and  S  to  denote  summation  over  individuals  with  respect  to  a  (sometimes 
P  or  y).    The  least-squares  condition  becomes 

(12.1.3)  S  (ya  -  £  ftxfa)2  =  minimum 

On  differentiating  with  respect  to  the  /5£  and  equating  the  derivatives  to  zero, 
we  have  for  the  estimators  fa  the  equations 


or,  equivalently, 
(12.1.4) 


Sxia(ya-Zfax^  =  0,         U-0,1,2 


s  l  xi*xjJj  =  £  xi*y«>      /  =  o,  i,  2 

7  =  0 


This  is  a  system  of  p  +  1  linear  equations  in  the/?  +  1  unknowns  fa.  Written 
out  in  full,  with  fa  =  bp  they  are 


(12.1.5) 


where 

(12.1.6) 

This  system  is  called  the  set  of  normal  equations  of  the  regression  problem. 
It  is  clear  from  the  definition  of  axj  in  (6)  that  au  =  aj{.  The  set  of  coefficients 
au  in  the  normal  equations  therefore  forms  a  symmetric  matrix.  If  we  denote 
this  square  symmetric  matrix  by  A  and  let  b  and  g  denote  the  one-column 
matrices  (usually  called  column  vectors) 


/  ^oQoo  +  bla0l  +  . 

•  •  +  bpa0p  =g0 

b0al0  +  Mil  +  • 

■  •  +  bpalp=gi 

^20   +  M21    +    • 

•  •  +bpa2p  =  g2 

b0apo  +  Mpi  +  • 

•  •  +  bpapp  =  gp 

aU  =   a  XiaXja-> 

9i  =  §  xiay* 

9 


do 


0, 


the  equations  of  (5)  may  be  written  in  the  compact  matrix  form 
(12.1.7)  Ab=g 
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The  matrix  solution  of  these  equations  is 

(12.1.8)  b=A~lg 

where  A~l  is  the  inverse  of  A,  that  is,  the  matrix  which  when  multiplied  by  A 
becomes  the  unit  matrix.  One  method  of  inverting  a  matrix  is  given  in  Appendix 
A.23.  There  are  other,  and  perhaps  speedier,  methods,  but  this  one  is  straight- 
forward and  systematic. 

12.2  The  Regression  Equations  and  Maximum  Likelihood    If  ya  =  rja  +  ea, 

and  if  the  £a  may  be  assumed  to  be  independently  and  normally  distributed  about 
zero  with  a  common  variance  <r2,  the  joint  probability  density  for  the  set  of  a's 
isL  =  <j-N(2nyN/2  Qxp[-S(sx2)/2(T2].  Therefore, 

(12.2.1)  log  L  =  -N  log  c  -  ^  log(27i)  -  ^V} 

The  condition  for  maximum  L  is  clearly  the  same  as  for  minimum  S(e2),  which 
is  equivalent  to  Eq.  (12.1.3).  The  method  of  maximum  likelihood  leads  there- 
fore to  the  same  normal  equations  as  the  method  of  least  squares. 

We  may  also  consider  the  b's  as  linear  functions  of  the  ya,  chosen  so  as  to 
be  the  best  unbiased  estimators  of  the  /Ts.  If  by  "best"  we  mean  having  minimum 
variance  (and  therefore  maximum  precision),  it  may  be  shown  [1]  that  by  using 
this  criterion  we  again  arrive  at  the  same  set  of  normal  equations. 

12.3  The  Solution  of  the  Normal  Equations    The  normal  equations  are 

(12.3.1)  Y.aubj=9n        i=0,l,2...p 

j 

Besides  the  matrix  solution  of  Eq.  (12.1.8),  there  is  an  elegant  theoretical 

solution  provided  by  Cramer's  rule,  namely, 

where  d(A)  is  the  determinant  of  the  matrix  A  (assumed  to  be  non-singular) 
and  dj(A)  is  the  determinant  of  the  matrix  derived  from  A  by  replacing  its  jth 
column  by  the  column  of  g's.  However,  for  p  >  3,  it  is  generally  safer  to  use 
a  process  of  systematic  elimination  of  the  unknowns,  one  at  a  time.  One  such 
process  is  the  Square  Root  Method,  often  called  Choleski's  method  ([2],  [3]). 
Details  of  the  process,  with  an  illustrative  example,  are  given  in  Appendix  A.24. 
There  are  several  other  methods  available,  but  this  is  one  of  the  more  compact 
schemes. 

12.4  The  Variances  and  Covariances  of  the  Regression  Coefficients    From 

the  assumptions  mentioned  in  §  12.2,  it  follows  that  the  expectation  of  bt  is 
equal  to  /?f  and  that  the  co variance  of  bt  and  bi  is  a2  aij,  where  aij  is  an  element 
of  the  inverse  matrix  A~*. 
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Since  gt  =  Sxiaya  and  bt  =  Z;  aijgp  we  have 

(12.4.1)  ^Z^'S*^* 

j 

Now  the  expectation  of  ya  is  *ya  =  Z*  ft*fta,  so  tnat 

(12.4.2)  E(W  =  £  «w  S  xia  £  ftxfca 

j  ft 

ft       J 

/c 

where  Sik  =  1  when  i  =  fc  and  0  when  /  #  &.  Therefore, 

(12.4.3)  £(*>*)=  ft 

Since  ea  and  fy  are  supposed  to  be  independent,  the  covariance  of  ya  and  ^ 
is  given  by 

(12A4)  C(ya,  yp)  =  dttPa2 

The  x's  being  fixed, 

(12.4.5)  C(gh  gj)  =  SS  xiaxjpC(ya,  yp) 

=  a2S  xiaxja 

=  *2<*ij 
since  the  only  non-zero  term  in  the  sum  over  /?  arises  when  ft  =  a.  Then 

(12.4.6)  C(bu  bj)  =  Z  «*  Z  «j7C(0*,  9i) 

=  gnWaH 

The  elements  of  the  matrix  ^4-1,  multiplied  by  a2,  give  therefore  the  variances 
and  co variances  of  the  regression  coefficients.  The  diagonal  terms  in  particular 
give  the  variances  (and  hence  the  estimated  standard  errors)  of  the  regression 
coefficients. 

The  variance  of  the  predicted  value  ycy  given  by  Eq.  (12.1.1)  for  some  new 
set  of  values  xu  x2  .  .  .  xpof  the  predictors,  is  obtained  from  Eq.  (6).   In  fact, 

(12.4.7)  V(yc)  =  E  [  Z  (*i  -  ft)*,]  2  =  Z  **/*[(*!  "  «X*j  -  /?,)] 

=  Z  xtXjC(bh  bj) 
=  a2  Z  a'7***/ 
The  variance  of  the  observed  y  which  would  correspond  to  the  new  observed 
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set  *!  .  .  .  xp  is  found  by  adding  <72,  which  is  the  variance  about  the  regression 
plane  (or  hyperplane).   It  becomes 


(12.4.8) 


V(y) 


L  i,j 


This  is  a  generalization  to  p  +  1  dimensions  of  the  two-dimensional  relation 
of  Eq.  (11.5.16).    For  if  x0  =  1  and  xl  =  x,  the  2  by  2  matrix  A  is 

N       Nx 


A  = 


Nx     Sx2 


and  its  determinant  is  d(A)  =  N(N  -  l)sx2.  The  inverse  matrix  is 

A~l  = 


(N  -  1)5, 


s2+x2     -x 


N 


—  x 


Therefore, 


^aijxiXj  =  a00  +2a0lx+allx 

N  -  1 


N 


+  x2  —  2xx  +  x2 


+ 


(N-\)sx2 
(x-x)2 


N      (N-\)sx 


as  in  §  11.5. 


12.5  Residuals  The  difference  between  the  observed  value  ya  and  the  com- 
puted value  yc,  for  a  given  individual  in  the  sample  is  called  a  residual,  and 
is  usually  denoted  by  va. 

(12.5.1)  va  =  ya  -  yc  =  ya  -  X  M,« 

i 

This  is  not  the  same  as  the  true  error  Sa  which  may  be  defined  by 

(12.5.2)  K  =  y«-n*  =  y«-Y.PiXi* 

i 

If  v  is  the  column  vector  of  the  va  (a  =  1,2...  N)  and  X  is  the  (p  +  l)-by-A/" 
matrix  (xia),  then 

Xv  =  X(y  -  X'b)  =  g  -  Ab 

where  X'  is  the  transpose  of  X  (with  rows  and  columns  interchanged).  This 
follows  from  the  definitions  of  a{j  and  g{  in  (12.1.6),  which  in  matrix  notation 
become  XX'  =  A  and  Xy  =  g.   But  by  (12.1.7),  g  =  Ab,  and  therefore 

(12.5.3)  Xv  =  0 
This  is  equivalent  to  the  set  of  equations 

(12.5.4)  Sx,.a^  =  0,      i  =  0,  1,2"...  p 
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The  residuals  are  therefore  said  to  be  orthogonal  to  each  of  the  predictors 

Xi,  X2  ■  • .  xp- 

The  sum  of  squares  of  the  residuals,  which  is  the  minimum  sum  of  squares 
in  (12.1.3),  may  be  written 

Sva2  =  v'v  =  (y-  X'b)'v 

a 

=  (/  —  b'  X)v  =  y'v 
since  Xv  =  0.  Therefore, 

(12.5.5)  Sv2  =  y\y  -  X'b)  =  y'y  -  g'b 

<x 

since  g'  =  y'X' .   In  scalar  notation  this  is 

(12.5.6)  Sv2  =  Sy2-Y.bigi 

a  a  1 

This  equation  gives  another  method  of  computing  the  sum  of  squares  of  residuals. 
It  should  be  noted  that  since  the  two  terms  on  the  right-hand  side  are  often 
nearly  equal  in  magnitude,  the  values  of  b{  used  should  be  correct  to  several 
more  significant  figures  than  are  required  in  the  final  sum  of  squares. 

12.6  Distribution  of  the  Sum  of  Squares  of  Residuals  Since  ya  =  na  +  Sa, 
and  since  the  Sa  are  supposed  to  be  distributed  with  expectation  zero  and 
variance  a2,  we  have 

E{y2)=E{n2+2r]a5a  +  d2) 

-n^o2 

Now  rja  =  X  PiXia, 

so  that 

*}*      =  X  PiPjXiaXja 
'.J 

Therefore, 

(12.6.1)  £(^/)-EftMv  +  ^2 

Also,  Y,i  bidi  =  X«.j  bjjatjbj,  so  that 

(12.6.2)  e(X  fe,-3,.  =  £  a,v£(M>,)) 

By  Eqs.  (12.4.3)  and  (12.4.6),     E(b,bj)  =  P,fi )  +  a2a'J  and  therefore, 

(12.6.3)  e(x  bi9)  =  Y  (PiPj  +  °2a")a>j 

\  i  '  i,j 

=  I  Pfijau  +  "2(P  +  D 

ij 

since  Y,ij  alJau  =  Zy  &jj  =  P  +  '»  (hj  Dem£  1  f°r  eacn  of  its/?  +  1  values. 
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Substituting  Eqs.  (1)  and  (3)  in  Eq.  (12.5.6),  we  obtain 
(12.6.4)  ElSv2\=o\N-p-\) 

which  means  that  an  unbiased  estimator  of  a2  is  furnished  by 


12.6 


(12.6.5) 


62  = 


w-^\{^) 


It  may  be  proved  (e.g.,  [4])  that  if  the  Sa  are  assumed  also  to  be  normal, 
then  (N  —  p  —  l)a2/a2  has  the  x2  distribution  with  N  —  p  —  1  degrees  of 
freedom.  Moreover,  on  this  assumption  the  bt  are  normally  distributed  and  are 
independent  of  62.  This  means  that  the  StudenW  distribution  can  be  used  to 
fix  confidence  intervals  for  the  bt.   In  fact, 

(12.6.6)  r=^-2 

with  N  —  p  —  1  degrees  of  freedom,  so  that  if  ta  corresponds  to  a  confidence 
coefficient  of  100(1  -  a)%, 

(12.6.7)  bt  -  dia^'X  <  ft  <  bt  +  a(au)lf\ 

The  variance  of  the  difference  of  two  coefficients  bt  and  bi  is  given  by 

(12.6.8)  V(bt  -  bj)  =  V(bd  +  V(bj)  -  IC^bj) 

=  a2(au  +  ajj  -  2aij) 

and  this  may  be  used  to  test  whether  two  coefficients  differ  significantly. 

Example  1  The  following  artificial  data  are  supposed  to  represent  the 
yield  y  of  a  chemical  reaction  under  different  conditions  of  (a)  time  of  reaction, 
(b)  temperature,  (c)  amount  of  an  added  ingredient.  Each  variate  xt  (i  =  1,  2,  3) 
takes  only  two  values,  which  we  may  code  as  —  1  and  1 .  The  variate  x0  is  a 
dummy  which  always  has  the  value  1 .  The  matrix  X  therefore  has  the  form 


X  = 


and  /  (the  row  vector  of  observations)  is  y'  =  [61,  83,  51,  70,  66,  92,  56,  83]. 
Then 

0        0        01 


1 

1 

1 

1 

1 

1 

1 

1 

-1 

1 

-1 

1 

-1 

1 

-1 

1 

-1 

-1 

1 

1 

-1 

-1 

1 

1 

-1 

-1 

-1 

-1 

1 

1 

1 

1 

XX'  = 


0 

8 

0 

0 

0 

0 

8 

0 

0 

0 

0 

8 
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g=Xy 


562 
94 

-42 
32 


Therefore. 


A~x  = 


i 

0 

0 

0 

i 

0 

0 

0 

i 

0 

0 

0 

and 


b=A~1g 


70.25" 

11.75 

-5.25 

4.00 


The  fitted  equation  is 

y  =  70.25  +  11.75*!  -  5.25x2  +  4.00x3 
The  residuals  are  given  by 

v'  =  [1.25,  -0.25,  1.75,  -2.75,  -1.75,  0.75,  -1.25,  2.75] 
and  Sv2  =  400/16  =  25. 

This  sum  of  squares  of  residuals  represents  both  experimental  error  and  the 
inadequacy  of  the  linear  model.  Unless  the  experimental  error  can  be  inde- 
pendently estimated  (for  example,  by  replicated  observations)  there  is  no  good 
way  of  telling  whether  the  linear  model  is  satisfactory. 

The  estimator  of  o2  in  this  example  is 


&: 


25 


8-4 


=  6.25 


For  four  degrees  of  freedom,  ta  corresponding  to  a  confidence  coefficient  of 
95  %  is  2.78,  and  for  each  value  of  i,  ali  =  J.  Therefore  the  confidence  interval 
for  each  bt  is  bt  ±  2.78  (0.78)1/2  =  bt  ±  2.45. 

12.7  Fitting  a  Polynomial  of  Second  or  Higher  Degree  Since  the  predictors 
Xt  . .  . ,  Xp  of  §  12.1  are  not  assumed  to  be  independent,  they  may  be  taken  as 
powers  of  a  single  variate  X,  say  X,  X2, . .  .  Xp,  and  the  method  of  least  squares 
may  then  be  used  to  fit  a  polynomial  of  degree  p  to  a  set  of  TV  observations 
of  pairs  of  values  (xa,  ya).  The  values  of  xa  may  be  chosen  arbitrarily  and  the 
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computations  will  be  simplified  if  they  can  be  taken  as  equally  spaced  along  the 
jt-axis.  Instead  of  Eq.  (12.1.6)  we  now  have 


(12.7.1) 


aij  =  Sxai+ji        i,j=0,  1... 
Qi  -  S  xjya 

Thus  if  we  wish  to  fit  the  quadratic 

(12.7.2)  yc  =  b0  +  btx  +  b2x2 

to  a  set  of  N  pairs  (xai  ya),  the  equation  giving  the  bt  is 

Ab=g 
or,  written  out, 


(12.7.3) 


If  we  choose  the  unit  of  x  so  that  the  values  (assumed  equally  spaced) 
increase  by  1  from  one  observation  to  the  next,  and  if  we  take  the  origin  of  x 
midway  between  the  first  and  last  observations,  we  shall  have  Sxa  =  Sxtt3  =  0, 
and  this  will  considerably  shorten  the  calculations.  The  equations  of  (3)  then 
become : 


N 

Sx, 

Sx/~ 

\b°] 

\Sy°      1 

Sxa 

Sx/ 

Sx/ 

h 

= 

Sx.ya 

Sx.2 

Sx/ 

Sxa\ 

lb2\ 

Sxa2y,_ 

(12.7.4) 


Nb0+Sxa2-b2=Sya 
Sxa2-b1  =Sxaya 
Sxa2-b0+SxaA'b2=Sxa2ya 


Example  2    Suppose  corresponding  values  of  xa  and  ya  are  as  given  in  the 
following  table : 


ya 

5 
10.0 

15 
8.1 

25 
9.3 

35 
12.1 

45 
13.6 

55 
17.5 

65 
20.0 

75 
24.0 

85 
30.0 

95 

42.5 

"a 

-4.5 

-3.5 

-2.5 

-1.5 

-0.5 

0.5 

1.5 

2.5 

3.5 

4.5 

If  we  replace  x  by  u  =  (x  —  50)/ 10,  the  conditions  Sua  =  Sua3  =  0  will  be 
satisfied.  Also  Su2  =  512  =  82.5,  Su*  =  ^|^  =  1208.6,  Sya  =  187.1, 
Suaya  =  273.45,  and  Su2ya  =  1817.98.  The  equations  for  b0,  bl9  b2  (in 
yc  =  b0  -\-  bxu  +  b2u2)  are,  therefore, 

1060  +82.5b2  =  187.1 
82.56i  =  273.45 
82.5^0  +  1208.6&2  =  1817.98 
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0.5197.    In  terms  of  x,  the 


from  which  b0  =  14.4225,  bx  =  3.3145,  and  b2 
best-fitting  quadratic  (or  parabola)  is 

yc  =  14.4225  +  0.33145(x  -  50)  +  0.005 197(x 
=  10.842  -  0.1882*  +  0.005197*2 


50): 


The  goodness  of  fit  may  be  estimated  from  the  sum  of  squares  of  residuals, 
which  in  this  case  amounts  to  22.36.  If  a  straight  line  were  fitted  by  the  same 
least  squares  process  the  equation  would  be 

yc  =  18.71  +  3.3145w 
=  2.138  +0.33145* 

The  sum  of  squares  of  residuals  is  164.95,  so  that  the  fit  of  the  parabola  is 
apparently  considerably  better  than  that  of  the  straight  line.  The  relation 
between  quadratic  and  linear  regression  may  be  brought  out  by  an  analysis 
of  variance,  as  in  Table  12.1. 


Table  12.1 

Variation 

S.S. 

D.F. 

M.S. 

Total  [S(ya  -  ?)*] 

1071.33 

9 

1 
8 

2 
7 

Linear  regression  [S(yc  —  y)2] 
About  linear  regression  [Sva2] 

906.38 
164.95 

906.38 
20.62 

Quadratic  regression  [S(yc  —  y)2] 
About  quadratic  regression  [Sva2] 

1048.97 
22.36 

524.48 
3.19 

The  variations  "about  regression"  are  the  sums  of  squares  of  the  residuals 
for  the  two  fitted  lines.    The  variations  "due  to  regression"  are  calculated  by 

difference  from  the  total  S.S.,  which  is  Sya2  —  —(Sya)2.    Since  there  are  two 

constants  for  the  straight  line  and  three  for  the  parabola,  calculated  from  the 
data,  the  degrees  of  freedom  for  variation  about  regression  are  TV  —  2  and 
N  —  3  respectively. 

The  reduction  in  S.S.,  due  to  replacing  the  straight  line  by  the  parabola,  is 
164.95  —  22.36  =  142.59,  with  1  d.f.  This  reduction  may  be  compared  with 
the  S.S.  about  the  parabola  (22.36  with  7  d.f.).  The  F-value  is  clearly  highly 
significant  (F  =  44.7  with  1  and  7  d.f.). 


*  12.8  Orthogonal  Polynomials  The  method  of  §  12.7  has  the  disadvantage 
that  if  we  want  to  improve  the  fit  by  using  a  higher  degree  polynomial  than  one 
already  fitted  (a  cubic  instead  of  a  quadratic,  for  example)  the  coefficients  for 
the  new  polynomial  have  to  be  calculated  afresh  from  the  beginning.    The 
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method  of  orthogonal  polynomials,  suggested  by  R.  A.  Fisher,  allows  us  to  add 
new  terms  independently  of  those  already  calculated.  Incidentally,  tests  of 
significance  of  the  coefficients  are  simplified. 

Two  polynomials  Pi(x)  and  P2(x)  are  said  to  be  orthogonal  for  the  set  of 
values  xa  (a  =  1,  2  ...  N)  if 


(12.8.1) 


Up^)-PiM]=0 


For   example,   the   polynomials  Px  =  x  —  4,   P: 


8x  +  12,  P,  = 


x    —  12x    +  41.x  —  36,  are  orthogonal  to   each   other   and  to  P0  =  1    for 


x  =  1,  2,  3  . 


7,  as  is  evident  from  the  following  table  of  values: 


Table  12.2 


X 

PoPi 

P0P2 

P0P3 

P1P2 

P1P3 

P2P3 

1 

-3 

5 

-6 

-15 

18 

-30 

2 

-2 

0 

6 

0 

-12 

0 

3 

-1 

-3 

6 

3 

6 

-18 

4 

0 

-4 

0 

0 

0 

0 

5 

1 

-3 

-6 

-3 

-6 

18 

6 

2 

0 

-6 

0 

-12 

0 

7 

3 

5 

6 

15 

18 

30 

0 

0 

0 

0 

0 

0 

It  can  be  proved  [5]  that  any  polynomial  of  degree  p  can  be  expressed  as  a 
linear  function  of  p  +  1  polynomials 


(12.8.2) 


P(x)=A0Z0+AlZl  +  ...  +A£ 


p^p 


where  ^  is  a  polynomial  in  x  of  degree  i.  The  equality  holds  for  N  distinct 
values  of  x,  denoted  by  xa,  and  the  polynomials  £f  are  all  orthogonal  to  each 
other.  If  x  takes  the  values  1,  2  ...  N,  the  first  few  orthogonal  polynomials  are 

/£o  =  l 

£1  =A1(x-x) 
{  £2  =A2[(x-  5c)2  _(A/2-l)/12] 

£3  =  A3[(x  -  x)3  -  (x  -  x)(3iV2  -  7)/20] 

£4  =  ^4[(*  -  *)4  -  (x  -  x)2(3N2  -  13)/14  +  3(N2  -  l)(iV2-  9)/560] 

where  x  =  (N  +  l)/2  and  the  A's  are  constants  chosen  so  as  to  make  the  £,- 
integers  (as  small  as  possible)  for  all  the  N  values  of  x.  Thus  if  TV  =  7  we  have 
x  =  4  and  the  constants  are:  At  =  1,  A2  =  1,  A3  =  1/6,  A4  =  7/12. 
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The  sets  of  values  of  these  polynomials  for  TV  =  7  are  given  in  the  following 
table: 

Table  12.3 


X 

h 

h 

h 

u 

1 

-3 

5 

-1 

3 

2 

-2 

0 

1 

-7 

3 

-1 

-3 

1 

1 

4 

0 

-4 

0 

6 

5 

1 

-3 

-1 

1 

6 

2 

0 

-1 

-7 

7 

3 

5 

1 

3 

On  comparing  Tables  12.2  and  12.3  it  is  clear  that  £lf  <^2  and  £3  are  the 
same  as  the  polynomials  previously  called  Pl9  P2  and  P3,  except  that  they  are 
now  multiplied  by  the  corresponding  A's.  All  the  polynomials  with  even  sub- 
scripts (like  £2  and  £4)  have  a  set  of  values  which  is  symmetric  about  the  middle, 
while  all  those  with  odd  subscripts  (like  ^  and  £3)  are  skew-symmetric.  It  is 
therefore  unnecessary  in  a  table  to  record  all  the  values,  and  usually  the  lower 
half  of  the  table  (with  the  middle  line  when  TV  is  odd)  is  all  that  is  actually 
printed.   For  Table  12.3,  this  would  be  the  values  from  x  =  4  to  x  =  7. 

If  a  polynomial  of  degree  p  is  to  be  fitted  to  a  set  of  TV  observations  (xa,  ya), 
the  method  of  least  squares  applied  to  the  equation  yc  =  P(x),  where  P(x)  is 
given  by  Eq.  (2),  leads  to  the  set  of  normal  equations : 


(12.8.4) 


NA0  +  S({la)Ai  +  •  •  •  +  S(ipa)AP  =  S(ya) 

S({u)Ao  +  S«i.2Mi  +  •  •  .  +  S(ilaLa)Ap  =  S(yaila) 


S(iJA0  +  .  .  .  +  S«*M, 


S(yaLJ 


However,  because  of  the  orthogonal  property  of  these  polynomials  (including 
£0  =  1),  all  the  terms  but  one  on  the  left-hand  side  vanish  in  each  of  these 
equations.   The  set  therefore  reduces  to 

(12.8.5)  AiSi&y  =  S(ya£ia),        t  -  0, 1 . . .  p 

from  which  the  A{  are  immediately  obtainable. 
The  sum  of  squares  of  the  residuals  is  given  by 

(12.8.6)  S(va2)  .  S(ya2)  -  A0S(ya)  -  A^iy&J  ...  -  ApS(yJpa) 

Since  A0  =  S(ya)/N  =  y,  the  first  two  terms  of  S(va2)  give  the  total  S.S. 
about  the  mean.  The  third  term  gives  the  reduction  due  to  linear  regression, 
the  fourth  term  the  additional  reduction  due  to  quadratic  regression,  and  so  on. 

The  numerical  work  of  calculating  the  At  is  greatly  facilitated  by  tables 
giving  the  values  of  fj  to  £5  for  different  values  of  N.  Such  tables  up  to  TV  =  75 
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may  be  found  in  Fisher  and  Yates'  Statistical  Tables  (Oliver  and  Boyd).   More 
extensive  tables  up  to  N  =  104  have  been  given  by  Anderson  and  Houseman  [6]. 

Example  3  Suppose  it  is  required  to  fit  polynomials  up  to  the  fourth  degree 
to  the  data  of  Example  2,  §  12.7.  We  replace  x  by  u  =  (x  +  5)/ 10,  so  that  u 
takes  the  values  1,  2  ...  10.  The  values  of  £u  f2,  £3,  £4  for  N  =  10  are  read 
from  the  tables. 

Table  12.4 


X 

u 

y 

h 

li 

6 

U 

5 

1 

10.0 

-9 

6 

-42 

18 

15 

2 

8.1 

-7 

2 

14 

-22 

25 

3 

9.3 

-5 

-1 

35 

-17 

35 

4 

12.1 

-3 

-3 

31 

3 

45 

5 

13.6 

-1 

-4 

12 

18 

55 

6 

17.5 

1 

-4 

-12 

18 

65 

7 

20.0 

3 

-3 

-31 

3 

75 

8 

24.0 

5 

—  1 

-35 

-17 

85 

9 

30.0 

7 

2 

-14 

-22 

95 

10 

42.5 

9 

6 

42 

18 

We  calculate  S(ya)  =  187.1,  S(ya2)  =  4571.97,  S0^la)  =  546.9,  S(ya$2a)  =  137.2, 
S(y.f  3«)  =  252-2>  SOV^J  =  196.8.  The  values  of  S(^2)  =  330,  S(£22)  =  132, 
^(^32)  =  8580,  S(£42)  =  2860  are  read  from  the  tables,  as  are  also  the  values 

Xx  =  2,  X2  =  1/2,  X3  =  5/3,  X4  =  5/12.    Then 


A0  =  187.1/10  =  18.71, 
Ax  =546.9/330  =  1.6573, 

A2  =  137.2/132  =  1.0394, 


A0S(ya)  =  3500.64 
^iS0^ia)=  906.38 
A2S(yaZ2a)  =  142.61 

^US0^4*)  =  13.54 


A3  =  252.2/8580  =  0.029394, 
AA  =  196.8/2860  =0.068811, 

The  polynomial  is 

(12.8.7)        yc  =  18.71  +  1.6573^  +  1.0394£2  +  0.029394^  +  0.068811£4 

where,  by  Eq.  (3)  with  u  =  5.5  and  N  =  10, 


(12.8.8) 


Z2  =  2-[(u-5.5)2-8.25] 

^3  =l[("-5.5)3- 14.65(ii -5.5)] 

U  =  ti[(u  -  5.5)4  -  20.5(u  -  5.5)2  +  48.2625] 


The  best-fitting  straight  line  is  given  by  the  first  two  terms  only  of  Eq.  (7),  the 
best-fitting  parabola  by  the  first  three  terms,  and  so  on.  On  replacing  u  by 
(x  +  5)/ 10  we  recover  the  results  of  §  12.7. 
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Table  12.5 
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Variation 

S.S. 

D.F. 

M.S. 

Total 

1071.33 

9 

1 
8 

1 
7 

1 
6 

1 

5 

Linear  regression 
About  linear  regression 

906.38 
164.95 

906.38^ 
20.62 

Additional  for  quadratic 

regression 
About  quadratic  regression 

142.61 

22.34 

142.61 
3.19 

Additional  for  cubic 

regression 
About  cubic  regression 

Additional  for  quartic 

regression 
About  quartic  regression 

7.41 
14.93 

13.54 
1.39 

7.41 
2.49 

13.54 
0.28 

Compared  with  the  deviation  about  the  cubic  regression  line,  the  additional 
sum  of  squares  for  cubic  regression  is  not  significant  (F  =  3.0  with  1  and  6  d.f.). 
However,  compared  with  the  deviation  about  quartic  regression,  the  additional 
S.S.  for  quartic  regression  is  highly  significant.  (F  =  48,  with  1  and  5  d.f.). 
This  indicates  that  the  cubic  curve  is  not  appreciably  better  than  the  parabola, 
but  the  quartic  curve  is  a  much  better  fit  than  both. 

Of  course,  by  using  a  ninth  degree  curve  we  could  fit  the  given  10  points 
exactly,  but  such  a  complicated  curve  is  obviously  not  desirable.  We  have  to 
compromise  between  the  desire  for  simplicity  and  the  desire  to  get  a  good  fit. 
The  second-degree  curve  (the  parabola)  is  probably  as  satisfactory  as  any 
polynomial  in  this  example. 

The  matrix  which  corresponds  to  A  in  Eq.  (12.1.7)  is  now  diagonal,  so  that 
it  is  very  easy  to  invert.   In  fact 


A  = 


N 


0 


0 


0      S(£la2) 


0 


s«pa2)J 


Therefore  ail  =  [S(£ia2)]    1,  and  the  estimated  variance  of  the  coefficient  Ax  in 
Eq.  (2),  i  =  0,  1 ,  2  .  .  .  p,  is  given  by 


(12.8.9) 


V(Ad  =  ail 


Sv„ 


N  -  p-  1 
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Thus  in  Eq.  (7),  the  estimated  standard  error  pf  A0  is  [1.39/50]1/2  =  0.17 
and  that  of  Ax  is  [1.39/(5  x  330)]1/2  =  0.029.  These  values  apply,  of  course, 
only  if  a  fourth-degree  curve  is  fitted,  since  the  residuals  va  relate  to  such  a 
curve.   The  sum  Sv2  is  given  by  Eq.  (6),  with  p  =  4. 

*  12.9  A  Test  for  Linearity  of  Regression  with  Grouped  Variates  When  the 
number  of  observations  is  sufficiently  large  to  warrant  grouping,  we  may  be 
presented  with  a  two-way  table  of  data  like  that  in  §  1 1 .4,  Example  2.  Each 
column  (or  x-array)  includes  all  the  observations  with  x-values  lying  in  one 
particular  class-interval,  and  these  are  all  assumed  to  have  the  same  value, 
namely  that  at  the  centre  of  the  interval. 

Within  one  column  the  j-values  are  also  grouped  in  classes,  and  in  each 
class  (i.e.,  in  each  row  of  the  table)  all  the  observations  are  assumed  to  have 
the  central  value  of  y.  We  suppose  that  there  are  p  columns  in  the  table  and 
that  the  total  frequency  in  the  /th  column  \s  ft  where  £/)  =  N. 

We  can  then  define  for  each  column  the  arithmetic  mean  yt  of  the  ^-values 
in  that  column.  If  these  column  means  are  plotted  against  the  central  x- values 
for  the  columns,  the  result  of  joining  them  is  a  sort  of  empirical  trend  line.  In 
fact,  if  a  straight'  line  is  fitted  by  least  squares  to  this  set  of  column  means,  each 
one  being  weighted  with  the  corresponding  column  frequency,  the  result  is 
precisely  the  ordinary  regression  line  of  y  on  x. 

The  sum  of  squares  of  deviations  from  the  column  mean  within  one  column 
is  S(yia  —  yt)2,  where  the  yia  are  the  y  values  in  the  /th  column,  (a  =  1,  2  .  .  .ft). 
The  ratio  of  this  sum,  added  up  for  all  columns,  to  the  total  sum  of  squares  for  y 
about  the  over-all  mean  y,  defines  a  quantity  called  the  correlation  ratio  of  y 
on  x  (Eyx)  by  the  relation: 

(12.9,)  l-E»2=tsU^W 

The  denominator  is  simply  (N  —  1)  sY2  =  Sy  (say),  since  the  sum  is  over  all  >> 
values  in  the  table.  The  above  expression  may  be  compared  with  one  obtained 
in  Chapter  11 — see  Eq.  (11.5.2) — namely, 

(12.9.2)  1-'*2=J-TT7 7T-1— 


(N  -  l)s 


where  yci  is  the  calculated  value  of  y  for  the  center  of  the  itb  column,  according 
to  the  linear  regression  equation  of  y  on  x.  This  indicates  that  Eyx  is  similar 
in  nature  to  the  Pearson  coefficient  r.  If  the  regression  is  in  fact  nearly  linear, 
the  two  agree  quite  closely,  but  the  more  the  regression  (as  indicated  by  the 
line  of  column  means)  departs  from  a  straight  line  the  more  do  Eyx  and  r  differ. 
The  difference  may  be  used  to  estimate  the  significance  of  an  apparent  departure 
from  linearity. 
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The  quantity  Eyx2  may  also  be  written 


(12.9.3) 


J7     2   -^1 
yx    "5„ 


where  Sy  is  the  weighted  sum  of  squares  of  the  column  means  about  the  over-all 
mean  and  Sy  is  the  total  sum  of  squares  for  y  about  the  over-all  mean.  That  is, 


(12.9.4) 


i 

Sy=TJSa(yi 


n 


y)2=(N-l)S] 


In  the  notation  of  §  1 1 .4,  with  the  auxiliary  variables  u  and  v, 

2 


( 


(12.9.5) 


V2 
S=(N-  l)k2sv2 


This  gives  the  most  convenient  formula  in  practice  for  calculating  Eyx. 
Analogously  to  Eq.  (3)  we  can  rewrite  Eq.  (2)  in  the  form 

(12.9.6)  r2=% 

where  SVe  =  £f  fi(yCi  —  y)2  which  is  the  weighted  sum  of  squares  for  the 
calculated  linear  regression  values  yci.  Therefore, 


(12.9.7) 


(£, 


r2)Sy  =  S-y-Syt 

=  I/i[(y, 


y)2 


(y«  -  y)2] 


and  so  represents  that  part  of  the  sum  of  squares  for  column  means  which  is 
not  accounted  for  by  linear  regression.  If  this  part  is  large  compared  with  the 
sum  of  squares  within  columns  about  the  column  means,  £f  Sa(yia  —  yt)2 
=  (1  —  Eyx2)Sy,  we  may  reasonably  reject  the  hypothesis  that  the  true  regres- 
sion is  linear.  The  test  ratio  is  therefore  (Eyx2  —  r2)/(l  —  Eyx2). 

If  the  values  of  y  within  each  column  are  normally  distributed  with  a  variance 
a2  common  to  all  the  columns,  then  Sa(yia  —  yt)2  for  any  column  is  distributed 
as  x2g2  with/f  -  1  degrees  of  freedom.  Therefore  (1  -  Eyx2)Sy  is  distributed  as 
X2g2  with  Yj  (ft  ~  1)  =  N  —  p  degrees  of  freedom. 

Furthermore,  the  p  values  of  yt  are  each  normal  with  variance  (r2/fh  so  that 
Xi-AO^  —  y)2  ls  distributed  as  x2o2  with/?  —  1  d.f.  Also 


=  b\N  -  l)s, 
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Now,  as  shown  in  §  11.5,  b  is  normal  with  varianee  a2j[(N  —  \)sx\  so  that 

b\N  -  \)sx2  is  distributed  as  yfa2  with  1  d.f.    It  follows  from  Eq.  (7)  and 

Theorem  4.3  that  {E^2  -  r2)Sy  is  distributed  as  x  V  with  p  -  2  d.f.,  and  is 

independent  of  Sr  The  ratio 

N  -  p  E2  -  r2 
(12.9.8) 


F  = 


E    2 


p-2    1 

is  therefore  distributed  as  Snedecor's  F  with  p  —  2  and  N  —  p  degrees  of 
freedom.  A  significant  value  of  F  indicates  a  significant  departure  from  linearity. 
The  test  is  one-tailed. 

Example  4    The  following  table  represents  some  results  on  the  relation 
between  the  percentage  protein  in  wheat  (y)  and  the  yield  in  bushels  per  acre  (x) 

Table  12.6 


X.  u 

-3 

-2 

-1 

0 

1 

2 

3 

4 

fv 

Vfv 

vHv 

5 

1 

1 

5 

25 

4 

0 

0 

0 

3 

3 

1 

4 

12 

36 

2 

2 

2 

4 

8 

16 

1 

3 

4 

1 

1 

9 

9 

9 

0 

2 

4 

15 

2 

1 

24 

0 

0 

-1 

4 

3 

1 

2 

2 

12 

-12 

12 

-2 

5 

3 

3 

1 

12 

-24 

48 

-3 

2 

1 

1 

7 

7 

2 

20 

-60 

180 

-4 

1 

1 

2 

1 

5 

-20 

80 

fu 

10 

12 

22 

13 

7 

14 

11 

2 

91 

-82 

406 

Ufu 

-30 

-24 

-22 

0 

7 

28 

33 

8 

0 

U2fu 

90 

48 

22 

0 

7 

56 

99 

32 

354 

V 

16 

16 

-9 

-19 

-14 

-37 

-29 

-6 

-82 

Vu 

-48 

-32 

9 

0 

-14 

-74 

-87 

-24 

-270 

V/fu 

1.60 

1.33 

-0.41 

-1.46 

-2.00 

-2.64 

-2.64 

-3.00 

V*/fu 

25.60 

21.33 

3.68 

27.77 

28.00 

97.79 

76.45 

18.00 

298.62 
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for  a  set  of  91  experimental  plots.  We  suppose  that  it  is  desired  to  predict  y  for 
a  given  x,  so  that  the  x  values  may  be  regarded  as  fixed. 

The  coded  u  and  v  values  are  given  by 

*-22 
u= ,        v  =  v  —  13.45 

From  the  table  and  Eq.  (5)  we  obtain 


5-  =  298.62  -  91 1  — —  j    =  224.73 
5^=406-91^^]    =332.11 


whence 


Eyx2  =  0.677 


Also  the  sum  of  squares  for  x  and  the  sum  of  products  for  x  and  y  are  given  by 

Sx  =  25(354  -  0)  =  8850 
Sxy  =  5(-  270-0)  =  -1350 


so  that 


Therefore 


(-1350)2  ^^ 

0.620 


(8850)(332.11) 


_  N  -  p  Eyx2  -  r2  _  83  0.057 
~  p-2    1-EyJ*  ~  ~6   0.323 

=  2.44 

with  6  and  83  d.f.  The  5%  point  is  about  2.21  so  that  there  is  a  significant 
departure  from  linearity. 

The  original  data  (ungrouped)  are  given  in  Snedecor's  Statistical  Methods 
(4th  ed.,  p.  380),  where  a  parabola  is  fitted  by  the  method  of  §  12.7.  The  dif- 
ference between  the  S.S.  about  the  parabola  and  the  S.S.  about  the  best  straight 
line  is  significant  as  compared  with  the  former  S.S.  itself.  This  confirms  the 
departure  from  linearity.  The  curve  of  column  means  (in  units  of  v)  is  plotted 
in  Fig.  53,  the  necessary  data  being  obtained  from  the  last  row  but  one  of 
Table  12.6.   For  comparison  the  best-fitting  parabola  is  also  shown. 

*  12.10  The  Distribution  of  the  Correlation  Ratio  The  correlation  ratio  Eyx 
may  be  used  as  a  measure  of  the  degree  to  which  the  observations  (as  grouped) 
tend  to  cluster  around  the  curve  of  column  means,  just  as  Pearson's  r  measures 
the  degree  of  clustering  around  the  straight  regression  line  of  Y  on  X.  A  similar, 
but  usually  different,  ratio,  Exyi  measures  the  clustering  around  the  curve  of 
row  means.  It  may  be  calculated  in  the  same  way  as  Eyx,  with  x  and  y  inter- 
changed throughout. 
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On  the  hypotheses  mentioned  in  the  previous  section,  and  on  the  assumption 
that  there  is  really  no  association  between  the  variates  in  the  parent  population, 


Fig.  53    Curve  of  column  means  and  fitted  parabola, 
for  data  on  wheat  yield  and  protein  content 


the  distribution  of  Eyx   was  worked  out  by  Hotelling  [7].   He  showed  that  Eyx 
is  a  beta-variate  with  parameters  nx  —  p  —  1  and  n2  =  N  —  p.    It  follows 
therefore  that  n2Eyx2/[nl(l  —  Eyx2)]  has  the  F-distribution  with  n^  and  n2  d.f. 


The  significance  of  an  observed  Eyx  may  be  tested  by  means  of  Pearson's 
Tables  of  the  Incomplete  Beta  Function  or  ordinary  tables  of  F.  A  special 
table  for  large  values  of  N,  50(1)1000,  was  prepared  by  Woo  [8]. 

If  the  population  correlation  ratio  rjyx  is  not  zero,  but  if  we  assume  that  in 
all  samples  there  are  the  same  set  of  frequencies/;,  the  density  function  for  E2  is 


(12.10.1) 


f(E2)  =  e-\E2)a-\l-E2) 


2\b- 


where   E2   is   written   for  E„ 


for 


H(XE2) 
B(a,  b) 

--  "i/2,  b 


n2/2,   and    X  = 


Aty2/[2(1  —  rj2)].  The  function  H,  with  argument  XE2,  is  called  the  confluent 
hypergeometric  function,  defined  by  the  series: 
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a  +  b         (a+  b)(a  +  6  +  1)    , 
(12.10.2)  H(x)  =  l+  —  ,-+<      JL—      >x»+... 

Tang  [9]  has  tabulated  the  distribution  function  for  E2,  namely,  the  proba- 
bility that  E2  <  E2  for  certain  values  of  A,  E2  being  fixed  by  the  condition : 


(12.10.3) 


/(£2|/l=0)d£2=a 


for  a  =  0.01  or  0.05.  The  tabulated  probability  is  therefore  that  of  an  error 
of  the  second  kind  (§  6.6),  the  chance  of  an  error  of  the  first  kind,  namely,  a 
wrong  rejection  of  the  null  hypothesis  that  X  =  0,  being  fixed  at  the  value  a. 

It  may  be  noted  that  E2  has  the  same  distribution  as  x  in  §  9.12,  where  x  is 
the  ratio  of  the  S.S.  between  treatments  to  the  total  S.S.  in  a  one-way  analysis- 
of- variance  problem.  The  difference  is  that  the  number  of  treatments  is  replaced 
by  the  number  of  columns  and  the  S.S.  between  treatments  by  the  S.S.  between 
column  means.  The  null  hypothesis  of  no  treatment  effects  becomes  the  hypo- 
thesis that  in  the  population  the  column  means  are  all  equal.  Under  this  hypo- 
thesis, and  if  the  variance  of  Y  is  the  same  within  each  column,  Sy/a2  has  the 
X2  distribution  with  nt(=  p  —  1)  d.f.  Under  the  alternative  hypothesis  (that  n, 
and  therefore  A,  is  not  zero)  it  has  the  non-central  chi-square  distribution  (Appen- 
dix A.  13)  with  non-centrality  parameter 

Nn2  N((jY2  -  g2) 

(121°-3)  ^20^)  =  ^^- 

where  aY2  is  the  variance  of  y  in  the  population  and  o2  is  the  variance  in  each 
column  about  the  column  mean.  When  X  =  0  this  distribution  becomes  the 
ordinary  (central)  chi-square  distribution. 

*  12. 1 1  Exponential  Regression  It  is  not  uncommon  for  a  variable  Y  to  increase 
or  decrease  with  time  at  an  approximately  uniform  percentage  rate.  This  holds, 
for  example,  for  money  accumulating  at  a  fixed  rate  of  compound  interest  or 
for  a  bacterial  population  growing  in  an  ample  supply  of  culture  medium.  In 
fact  this  type  of  increase  is  often  referred  to  as  "the  law  of  growth."  It  may  be 
expressed  mathematically  by  the  relation : 

(12.11.1)  —=kdX 

or,  in  integral  form, 

(12.11.2)  Y=AekX 

If  we  wish  to  fit  such  an  exponential  curve  by  least  squares  to  a  set  of  AT 
pairs  of  observed  values  (xa,  j>a),  a  =  1 ,  2  ...  N,  we  have  to  calculate  A  and  k 
from  the  relation: 

(12.11.3)  S  (ya  -  Aekx")2  =  minimum 
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from  which  we  get,  by  differentiating, 

S  ekx\Ae 


kxa 


(12.11.4) 


$xae 


kxB 


(Aek 


-y«)  =  o 


The  exact  solution  of  these  equations  for  the  unknowns  A  and  k  is  tedious 
and  time-consuming.    It  is  customary  instead  to  write  Eq.  (2)  in  the  form 

(12.11.5)  log  Y  =logA  +  kX 

and  to  fit  a  straight  line  by  the  method  of  Chapter  1 1  to  the  observed  values  of 
log  ya  and  xa.  This,  of  course,  means  that  the  sum  of  squares  of  deviations  for 
log  Y  is  minimized  instead  of  the  corresponding  S.S.  for  Y.  If  the  standard 
deviation  for  Y  is  proportional  to  Y  itself,  as  seems  to  be  nearly  true  for  many 
types^  of  data  in  economics,  this  procedure  is  quite  reasonable,  since 
3  (log  Y)  «  6  Y/Y  and  the  standard  deviation  of  log  Y  is  therefore  approximately 
constant.  If,  however,  there  is  reason  to  believe  that  the  standard  deviation 
of  Y  itself  is  constant,  the  effect  of  the  customary  procedure  is  to  give  undue 
weight  to  the  smaller  values  of  Y. 

A  method  of  allowing  (at  least  approximately)  for  this  effect  is  to  fit  a  straight 
line  to  the  observed  log  ya  but  to  weight  each  observation  proportionately  to  ya. 
The  weighted  least-squares  condition  is 

(12.11.6)  S(log  ya  -  a  -  kxa)2-ya  =  min 

where  a  =  log  A  in  Eq.  (5).   This  furnishes  the  normal  equations : 
J  aSya  +  kS(xaya)  =  S(ya  log  ya) 

[  aS(xaya)  +  kS(xa2ya)  =  S(xaya  log  ya) 

from  which  a  and  k  can  be  found.  If  common  logarithms  instead  of  natural 
logarithms  are  used,  the  equation  for  Y  will  be  of  the  form  Y  =  ^4-10fcx  instead 
of  that  in  Eq.  (2). 

Table  12.7 


(12.11.7) 


X 

y 

loglOj' 

6 

0.029 

-1.538 

7 

.052 

—1.284 

8 

.079 

-1.102 

9 

.125 

-0.903 

10 

.181 

-0.742 

11 

.261 

-0.583 

12 

.425 

-0.372 

13 

.738 

-0.132 

14 

1.130 

0.053 

15 

1.882 

0.275 

16 

2.812 

0.449 

Example  5    (Snedecor  [11])    In  Table  12.7,  x  represents  the  age  in  days  of 
chick  embryos  and  y  the  dry  weight  in  grams. 


12.12  REGRESSION  ANALYSIS  AND   CURVE  FITTING  349 

The  calculated  yc  for  a  given  x,  obtained  by  fitting  a  straight  line  to  the 
weighted  values  of  log  y,  is  yc  =  0.001875(10)01989x  =  0.001875e04581*. 

Without  weighting,  the  result  is  yc  =  0.002046*?0-4511*,  while  the  exact  least- 
squares  solution  is  yc  =  0.001 895e0-4573\  The  method  of  weighting  the  observed 
values  of  log  y  gives  (at  least  in  this  example)  a  very  good  approximation. 

In  some  problems  the  data  follow  more  or  less  a  modified  exponential  curve, 
expressed  by 

(12.11.8)  Y  =  C+Aekx 

The  exact  least-squares  solution  of  Eq.  (8)  for  a  set  of  sample  values  xa,  ya, 
is  even  more  difficult  than  for  Eq.  (2),  and  when  plotted  on  semilog  graph  paper 
the  points  (xa,  ya)  do  not  lie  nearly  on  a  straight  line.  However  it  is  possible 
to  use  a  graphical  method  due  to  Cowden  [10]  to  obtain  approximate  values  of 
C,  A  and  k,  and  these  may  be  improved,  if  necessary,  by  using  Seidel's  process 
(§12.12). 

The  data  are  plotted  on  ordinary  or  semilog  graph  paper  and  a  tentative 
trend  line  is  drawn  in  by  hand.  Three  equidistant  ordinates  70,  Yx  and  Y2  of 
the  curve  at  convenient  values  of  X  (X  —  h,  X  and  X  +  h,  say)  are  measured, 
and  C  is  estimated  from  the  relation : 

Y  Y   —  Y  2 

(12.11.9)  -  C=       °   2         x 


Y0  +  Y2-2Yr 


Values  of  ya  —  C  are  now  plotted  on  semilog  graph  paper.  If  C  is  correct,  these 
points  should  lie  close  to  a  straight  line;  if  there  still  appears  to  be  some  curvature 
the  value  of  C  may  be  readjusted  slightly  by  trial.  From  the  resulting  straight 
line,  A  and  k  may  be  estimated,  A  being  the  ordinate  at  X  =  0  and  ekXl  the 
ratio  of  the  ordinates  at  X  =  xx  and  X  =  0. 

*  12.12  Seidel's  Method  of  Successive  Approximations  Sometimes  it  is  con- 
venient to  obtain  approximate  values  of  the  regression  coefficients  from  a  graph. 
With  these  as  a  start,  Seidel's  method  permits  better  values  to  be  obtained  by  a 
least-squares  procedure. 

Suppose  the  regression  curve  is  of  the  form 

Y=f(xfp0%p1) 

where  p0  and  /?t  are  the  true  parameters.  (The  method  can  easily  be  extended 
to  more  than  two  parameters.)  If  the  preliminary  approximations  are  b0  and  bl9 
let  Sb0  =  p0  —  b0  and  Sb1  =  ^  —  bv  Then  if  the  approximations  are  reason- 
ably good  we  should  be  able  to  neglect  squares,  products  and  higher  powers  of 
db0  and  Sbl9  so  that 

(12.12.1)  Y  =/(*,  b0  +  db0,  by  +  5b  J 

db0  dbt 


(12.12.2) 


350  INTRODUCTION  TO  STATISTICAL  INFERENCE  12.12 

where  df/db0,  df/dbt  mean  the  partial  derivatives  of /(*,  /?0,  pt)  with  respect  to 
fi0  and  pl9  evaluated  at  fi0  =  b0  and  ^±  =  bv  Then  Sb0  and  Sbx  are  selected  so 
as  to  minimize  the  sum  of  squares  of  residuals 

s[ym-f(xvb0,bl)-§-oSb0-^5b^ 

The  normal  equations  for  db0  and  8b  i  may  therefore  be  written 

\SH         t      8f  zu        df  xu  \      n 

s-W0(y--f-eb-3b°-eb-1Sbl)=0 

s'wy'-f-8b-Sbo-8fldbi)=° 

where,  in  /  and  its  partial  derivatives,  x  is  replaced  by  xa.  On  solving  these 
equations  and  adding  db0  to  b0  and  Sbl  to  bu  the  preliminary  approximations 
are  improved.  The  process  can  be  repeated  if  necessary  and  usually  converges 
quite  rapidly. 

Example  6  Suppose  that  we  have  plotted  on  semilog  paper  the  data  in 
Example  5  above  and  have  drawn  by  eye  an  approximately  best-fitting  straight 
line.  From  the  line  we  estimate  kY  =  0.20  and  aY  =  log^  =  —2.70,  these 
being  first  approximations  to  k  and  a(  =  log  A)  respectively.   Then 

log  Y  =  a  +  kx 

so  that  df/da  =  1,  df/dk  =  x.  The  weighted  normal  equations,  weighted 
according  to  the  values  of  ya9  are 

S|>a(log  ya  ~al-  ktxa  -  Sal  -  x^kj]  =  0 

S[yaxa(\og  ya-al-  /ctxa  -  da1  -  x^/cj)]  =  0 

On  substituting  the  values  of  xa  and  ya,  these  become 

7.714(50!  +  110.712^!  =  -0.32786 

110.7125a!  +  1619.18^1  =  -4.73783 
from  which 

Sat  =  -0.0271,        Skt  =  -0.0011 

Therefore  the  improved  values  of  a  and  k  are 

a2  =  -2.70  -  0.0271  =  -2.7271 

k2  =0.20-0.0011  =0.1989 


The  fitted  curve  is 


or 


log  Y  =  -2.7271  +0.1989* 


Y  =  0.001875(10)01989* 
which  agrees  with  the  result  quoted  in  §  12.11. 
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PROBLEMS 
A.  (§§12.1-12.3) 

1.  Write  out  the  normal  equations  (12.1.5)  for  the  case  of  two  predictors 
(Xo  =1,   Xi  =  X,  Xi  =  Z)  and  show  that  these  equations  can  be  put  into  the  form: 

y  =  bo  +  bix  +  bzz 
syx  ~  bxsx   +  b2sxz 

SYZ    =   °\SXZ     I     °2SZ 

where  x,  y,  z,  are  the  means,  and  sx2,  sY2,  sz2,  sYX,  sYZ,  sxz  are  the  variances  and 
covariances,  for  the  variates  X,  Y  and  Z. 

2.  Show  that  the  equation  of  the  regression  plane  of  Y  on  X  and  Z  may  be  written 
(yc  —  y){d\jSY)  +  (x  —  x)(d2/sx)  +  (z  —  2){dzls7)  =  0  where  cfi,  ck,  dz  are  the 
cofactors  of  the  elements  of  the  first  row  in  the  determinant 


1                     rXY 

d  = 

rXY             1 

rzy          r^rz 

The  qu£ 

intities  r^^  r^ 

1 

ifwf:  See  Problem  1.   The  quantities  rXY,  rxz,  rZY  are  the  respective  coefficients  of 
correlation. 

3.  {Hooker)  For  a  certain  district  in  England,  records  were  kept  over  20  years  of 
the  following  variates : 

Y  =  seed-hay  crop  (cwt/acre) 
X  =  spring  rainfall  (inches) 

Z  =  accumulated  temperature  above  42°F  in  spring. 

From  the  data  the  following  statistics  were  calculated: 

x  =  4.91,    y  =  28.02,  z  =  594,  sx  =  1.10,  5y  =  4.42,  sz  =  85,  rXY  =  0.80,  r^z  = 

-0.56,  rZY  -  -0.40. 

Use  the  result  of  Problem  2  to  find  the  regressiotr  of  hay  crop  on  spring  rainfall 
and  accumulated  temperature. 

4.  At  an  experimental  farm  in  Alberta,  records  were  kept  over  35  years  of  the 
evaporation  ( Y)  from  an  open  tank.  It  was  thought  that  this  might  be  related  to  the  date 
of  observation  (X)  and  to  the  annual  rainfall  (Z).  With  the  date  coded  as  an  integer 
from  1  to  35,  the  following  observations  were  recorded:  Sxa  =  630,  Sxa2  =  14,910, 
Sza  =  286.90,  Sza2  =  2563.47,  Sya  =  452.53,  Sya2  =  5980.79,  Sxaya  =  7814.63, 
Szaya  =  3626.96,  Sxaza  -  5287.95. 

Write  out  the  matrices  A  and  g  and  find,  by  Cramer's  rule,  the  predicting  equation 
for  Y  in  terms  of  X  and  Z. 

5.  (P.  O.  Johnson).  As  part  of  a  study  dealing  with  the  prediction  of  freshman 
achievement  in  college,  the  following  scores  were  noted  for  each  of  a  random  sample  of 
50  students : 

Y  =  honor-point  ratio  at  end  of  freshman  year, 
Xi  =  score  on  an  English  test, 

X%  =  score  on  an  algebra  test, 

Xz  —  percentile  ranking  at  high  school  graduation, 

transformed  to  probits  (this  provides  in  effect 

a  normal  variate  with  mean  5). 

From  the  data  obtained, 

40.6393  W  =  50, 

1560  Sxz*       =  248.22 

66,942  Sx3a2      -  1260.7630 

1213.74  Sxsaya    =  189.1539 

23,926.22  Sx2aX*a  =  7804.57 


Sya         =  36.19 

Sya2 

Sxia       =  4802 

SX2a 

5^la2      =  487,798 

SX2a2 

Sxi*ya    =  3533.58 

SX2aya 

SxiaXla  =  157,863 

SxiaX3a 
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Write  out  and  solve  the  normal  equations  and  so  obtain  the  equation  for  predicting 
Y  from  Xi,  Xi  and  X3. 

B.  (§§12.4-12.6) 

1.  In  Problem  A-4,  invert  the  matrix  A.  Use  Eq.  (12.5.6)  to  find  the  sum  of  squares 
of  residuals,  and  so  obtain  an  estimate  of  the  variance  a2  of  Y  about  the  true  regression 
plane. 

2.  In  Problem  A-4,  obtain  90%  confidence  intervals  for  the  regression  coefficients 
(3o  j8i  and  £2.  (For  32  degrees  of  freedom,  ta  =  1.694.) 

3.  Write  down  the  matrix  A  for  the  data  of  Problem  A-5.  Calculate  the  diagonal 
terms  in  the  inverse  matrix  and  hence  obtain  the  variances  of  the  three  regression 
coefficients  bi,  bi  and  bz  in  the  predicting  equation  for  Y.  Show  that  only  the  regression 
on  Xs  is  significant. 

4.  From  the  data  of  Problem  A-3,  calculate  approximately  the  values  of  the  matrix 
elements  aij,  and  so  obtain  an  estimate  of  the  variance  a2  about  the  true  regression 
plane.  Estimate  also  the  variance  of  the  predicted  value  of  Y  for  a  new  pair  of  observed 
values  of  X  and  Z. 

C.  (§§  12.7-12.B) 

1.  Fit  a  second-degree  parabola  to  the  following  data: 


X 

1.0 

1.5 

2.0 

2.5 

3.0 

3.5 

4.0 

y 

1.1 

1.3 

1.6 

2.3 

2.7 

3.4 

4.1 

2.  Construct  an  analysis  of  variance  table  for  the  data  in  Problem  C-l,  and  deter- 
mine whether  there  is  a  significant  reduction  in  the  sum  of  squares  about  the  regression 
line  when  the  straight-line  regression  is  replaced  by  parabolic  regression. 

3.  If  a  correlation  index  rc  is  defined  by  rc2  =  1  —  (Sva2)/[(N  —  l)sY2],  where  va 
is  a  residual  for  parabolic  regression,  find  the  value  of  rc  for  the  data  of  Problem  C-l. 
Compare  with  the  Pearson  coefficient  of  correlation  for  the  same  data. 

4.  (Holzinger)  In  the  following  table,  X  represents  mean  age  in  years  for  a  group 
of  men,  and  Y  their  mean  vital  capacity.  Use  the  method  of  orthogonal  polynomials 
to  find  the  equation  of  the  best-fitting  cubic  curve. 


X 

Y 

X 

Y 

X 

Y 

19.5 

227 

37.5 

223 

55.5 

201 

22.5 

230 

40.5 

218 

58.5 

185 

25.5 

230 

43.5 

216 

61.5 

200 

28.5 

237 

46.5 

210 

64.5 

169 

31.5 

227 

49.5 

205 

67.5 

160 

34.5 

229 

52.5 

193 

70.5 

163 

Note:  The  following  extract  from  Anderson  and  Houseman's  tables  is  relevant: 

TV-  18 


£1 

6 

13              5              7              9 
-40         -37         -31         -22         -10 

-8         -23         -35         -42         -42 

11 

5 
-33 

13 

23 
-13 

15 
44 
20 

17 
68 
68 

5(6 2)  =  1938,        S(£22)  =  23,256, 

Ai  =  2,        A2  =  3/2,        A3 

S(62)  - 

=  1/3 

=  23,256 
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5.  Draw  up  an  analysis  of  variance  table  for  the  regression  of  Problem  C-4.  Find 
estimates  of  the  standard  error  for  each  of  the  coefficients  of  the  orthogonal  polynomials 
obtained  in  this  problem. 

D.  (§§  12.9-12.10) 

1.  In  the  following  table  X  is  the  amount  of  irrigation  water  (inches)  applied  to  a 
crop,  and  Y  is  the  crop  yield  in  bushels  per  acre.  The  numbers  in  the  headings  are  the 
class-marks  (central  values)  of  the  respective  classes.  Test  the  regression  of  Y  on  X  for 
linearity. 


\jr 

12 

15 

18 

21 

24 

27 

30 

y\ 

90 

1 

2 

3 

85 

2 

3 

5 

80 

2 

5 

4 

1 

12 

75 

2 

4 

6 

1 

13 

70 

4 

3 

1 

8 

65 

2 

3 

5 

60 

2 

2 

4 

2 

4 

10 

19 

8 

5 

2 

50 

2.  In  Problem  B-4  of  Chapter  1 1 ,  calculate  the  two  correlation  ratios  Eyx  and  Exy. 
Does  either  regression  (of  height  on  weight  or  of  weight  on  height)  depart  significantly 
from  linearity  ? 

3.  Prove  the  statement  in  §  12.9  that  if  a  straight  line  is  fitted  by  least  squares  to 
the  weighted  column  means  in  a  grouped  two-way  table,  the  result  is  the  ordinary 
regression  line  of  Y  on  X. 

4.  Show  that  when  A  =  0,  the  variate  E2  in  Eq.  (12.10.1)  becomes  a  beta-variate. 
Show  also  that  in  this  case  niE2l\n\(\  —  E2)]  has  the  F  distribution  with  tti  and  m 
degrees  of  freedom. 

E.  (§§12.11-12.12) 

1.  The  uniform  horizontal  scale  on  a  sheet  of  semilog  paper  ranges  from  0  to  10. 
The  vertical  logarithmic  scale  (on  the  left  side)  ranges  from  100  to  1000.  A  straight 
line  is  drawn  from  the  upper  end  of  the  vertical  scale  to  the  midpoint  of  the  horizontal 
scale.  What  is  the  equation  of  the  line  (a)  in  the  coordinates  x  and  y\  where  y'  =  logioj, 
(b)  in  the  coordinates  x  and  y  ? 

2.  A  straight  line  is  drawn  on  semilog  paper  through  the  points  (2,  1)  and  (4, 100). 
What  is  the  equation  of  this  line  ? 

3.  Fit  an  exponential  curve  to  the  following  data  (a)  without  weighting,  and  (b) 
with  weighting: 

12  3  4  5 


2.1 


4.3 


14.5 


42.2 


123.1 


Hint:  For  part  (b),  use  Eq.  (12.11.7). 

4.  Prove  that  the  exact  least-squares  solution  of  the  problem  of  fitting  the  modified 
exponential  curve  y  =  C  +  Aekx  requires  the  determination  of  A,  C  and  k  from  the 
equations : 

Sya  =  NC  +  ASekx« 

S(yae*x«)      =  CSekx«  +  ASe2kx« 

S(yxxaekx")  =  CS{xaekx°)  +  AS(xae2kx>) 
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5.  Use  the  method  of  least  squares  to  fit  the  curve  y  =  ax2  +  b/x  to  the  following 
data: 

12  3  4 


-1.51     0.99     3.88     7.66 

6.  The  logistic  curve  yc  =  a{\  +  bqx)~l  has  been  used  to  represent  population 
growth.  Fit  such  a  curve,  by  Cowden's  method,  to  the  following  data  on  the  population 
(in  millions)  of  the  United  States,  1790  to  1950: 


X 

1790 

1800 

1810 

1820 

1830 

1840 

1850 

1860 

y 

3.93 

5.31 

7.24 

9.64 

12.87 

17.07 

23.19 

31.44 

X 

1870 

1880 

1890 

1900 

1910 

1920 

1930 

1940 

1950 

y 

39.82 

50.16 

62.95 

76.00 

91.97 

105.71 

122.78 

131.67 

150.70 

Hint:  Write  the  equation  \/yc  =  A  +  Bqx,  where  A  =  l/a,  B  =  b\a,  q  =  ep,  and  plot 
values  of  \/y  instead  of  y.  Use  coded  x  values. 

7.  Show  that  the  Gompertz  curve,  yc  =  abqX,  may  be  approximately  fitted  by 
Cowden's  method  if  log  y  is  plotted  instead  of  y.  This  curve  is  used  in  actuarial  work. 

8.  The  following  data  were  obtained  in  a  physical  experiment,  where  E  represents 
the  energy  radiated  from  a  carbon  filament  lamp  per  cm2  per  sec,  and  T  the  absolute 
temperature  of  the  filament  in  thousands  of  degrees  K. 


T 

1.309 

1.471 

1.490 

1.565 

1.611 

1.680 

E 

2.138 

3.421 

3.597 

4.340 

4.882 

5.660 

By  plotting  on  log-log  graph  paper  it  is  seen  that  the  data  follow  apparently  a  law  of 
the  type  E  =  aTb,  with  a  =  0.725  and  b  =  4.0  approximately.  Use  the  Seidel  method 
to  improve  these  values  of  a  and  b. 
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Chapter  13 

SOME  REMARKS  ON  MULTIVARIATE 
PROBLEMS  AND  STOCHASTIC  PROCESSES 

13.1  Multiple  Regression  in  Terms  of  Correlation  In  Chapter  12  we  con- 
sidered the  multiple  regression  of  one  variate  on  a  number  of  others,  and  the 
present  treatment  is  closely  connected  with  that  in  §§  12.1  to  12.6. 

For  simplicity  we  will  first  assume  that  the  predicted  variate  Y  depends  on 
just  two  predictors  Xx  and  X2.  (The  generalization  to  any  larger  number  is 
easily  made.)  The  linear  predicting  equation  for  Y  is  of  the  form 

(13.1.1)  yc  =  b0  +  b1x1  +  b2x2 

Note  that  we  are  not  now  using  the  dummy  variate  (which  is  always  equal  to  1) 
as  in  §  12.1.  The  new  notation  will  be  more  convenient  in  the  present  context. 
Geometrically,  Eq.  (1)  represents  a  plane  in  the  three-dimensional  sample  space 
with  coordinates  xl9  x2  andj.  This  is  called  the  regression  plane  of  Yon  Xt  and  X2. 
Suppose  that  N  sets  of  observations  (xla,  x2a,  ya)  are  made  on  the  three 
variates.  For  the  sake  of  uniformity  we  will  let  the  variate  Y  be  called  X0t 
with  values  denoted  by  x0a.  The  normal  equations  corresponding  to  Eq.  (12.1.5) 
will  now  be 

r  Nb0  +  biSxla  +  b2Sx2a  =  Sx0a 

(13.1.2)  b0Sxla  +  6tS(xla2)  +  b2S(xlax2a)  =  S(x0axla) 

,  b0Sx2a  +  &iS(xlax2a)  +  b2S(x2a2)  =  S(x0ax2a) 

To  simplify  these  equations  we  can  suppose  that  the  origin  is  chosen  at  the 
arithmetic  mean  of  each  of  the  variates  X0,  Xt  and  X2.  Then  Sxla  =  Sx2a  = 
Sx0a  =  0.  Also,  if  a*!2,  s22  and  s02  denote  the  sample  variances  of  Xl9  X2  and  X0 
respectively,  and  if  ru  is  the  sample  Pearson  coefficient  of  correlation  between 
Xt  and  Xj  (ij  =  0,  1,  2),  we  have 


(13.1.3) 

(S(xla2)  = 
\s(xlax2a) 

(N-l)Sl2,        S(x 

2,2)  = 

etc. 

so  that  the  equations  of  (2)  become 

(b0=0 

(13.1.4) 

I  fcjSi  +  b2rl2s2  = 

:  roi5o 

V  ^i^i2si  +  b2s2  = 

r02S0 
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From  these  we  get 
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(13.1.5) 


*i 


So(r01   —  r02r12) 


«i(l 


12 


') 


b2  = 


S0V02  —  r01r12) 


s2(l 


12 


If  we  let  Rij  be  the  cofactor  of  rvj  in  the  determinant  of  the  correlation  matrix 


(13.1.6) 


where,  of  course,  ri 
forms : 


(13.1.7) 


roo 

r0i 

r02 

7*01 

rll 

^12 

^02 

**12 

r22_ 

R  = 

1  when  i  =  y,  we  may  express  bx  and  62  in  the  equivalent 
so^oi 


bi  = 
b2  =  - 


si^oo 

S0^02 
S-yKnn 


The  equation  of  the  regression  plane  may  therefore  be  written  in  the  symmetrical 
notation : 


(13.1.8) 


*0^00         X1^01 

s0  st 


x2JR02 


=  0 


If  we  have/?  variates  on  which  X0  may  depend,  the  equation  of  the  regression 
hyperplane  of  X0  on  Xl9  X2  .  .  .  Xp  is 


(13.1.9) 


£     *l*0i 


=  0 


where  i?0i  is  the  cofactor  of  r0i  in  the  determinant  of  R>  the  matrix  with  typical 
element  ru.  The  predicted  x0  can  be  written,  when  R00  #  0, 


(13.1.10) 


X0c  — 


S0      yi    ^0i*i 


so  that  the  relative  contribution  of  the  variate  Xt  to  the  prediction  of  X0  is 
measured  by  the  coefficient  (7?0i50)/(/?005f).  Equation  (10)  is,  of  course,  precisely 
equivalent  to  Eqs.  (12.1.1)  and  (12.3.2)  but  in  a  different  notation.  The  purpose 
of  giving  this  alternative  form  is  to  showthe  relation  of  the  regression  coefficients 
to  the  coefficients  of  correlation  between  the  different  variates. 


13.2  Multiple  Correlation  If  va  is  the  difference  between  the  observed  x0c 
and  the  computed  value  given  by  Eq.  (13.1.10)  for  the  observed  xic 
(/=l,2...p), 


(13.2.1) 


Sva2=S  x0a+X 


p     s0^0ixia 


oRoiXia\ 
R()0Si    I 
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This  is  the  sum  of  squares  of  residuals.  On  using  Eq.  (13.1.3)  it  becomes 
(13.2.2)     Svaz  = 


-Roo 


( ^oo    +  Z  Rot    +  ^oo  Z  Rofoi  +  Z  RoiRojrij  I 


p      2  2,    *0iK0jrij 


But  we  know  from  the  properties  of  cofactors  that  Z;  Rofij  =  0  if  /  #  0 
and  =  d(R)  if  i  =  0,  where  d(/?)  is  the  determinant  of  R,  so  that 


(13.2.2) 


Sv*=(N-lWdP 


R 


oo 


The  quantity  S(va2)/(N  -  1)  is  the  approximate  variance  of  estimate  of  x0, 
denoted  by  a'0)212  . .  p.  Therefore 

(13.2.3) 


2d(R) 

S0,12  "  p  —  S0     '"" 


^00 

As  in  §  12.6,  it  may  be  proved  that  (N  —  l)s02l2..pl(N  —  p  —  1)  is  an  un- 
biased estimate  of  the  corresponding  population  parameter  obfi2-«p»  usually 
denoted  simply  by  a2. 

The  variance  due  to  regression  may  be  defined  by 

(13.2.4)  50212..p  =  s02-50f12../, 


-■(-a 


The  ratio  of  the  variance  due  to  regression  to  the  total  variance  of  X0 
(namely,  s02)  is  the  square  of  the  multiple  correlation  coefficient 

(13.2.5)  r2..    =1 


R 


oo 


For  the  case  p  =  1,  d(R)  =  1  -  r012  and  R00  =  1,  so  that  r0tl2.:p  reduces  to 
the  ordinary  correlation  coefficient  between  X0  and  A\. 

Example  1 .  The  variates  X0,  Xl  and  X2  have  pairwise  correlation  coefficients 
rol  =  0.8,  r02  =  -0.7,  r12  =  -0.9.   The  matrix  R  is 

1.0        0.8     -0.7' 

0.8        1.0     -0.9 

-0.7     -0.9        1.0 

so  that  d(R)  =  0.068,  R00  =  0.19.  Therefore  r02l2    =  1  -  0.36  =  0.64,  so  that 
r0tl2  =  0.80. 

Example  2    If  r0l  =  0.6  and  r02  =  0.4,  find  ri2  so  that  r0>12  =  1. 

If  we  write  rl2  =  r,  and  substitute  the  known  values  in  R,  we  find  that 
d(R)  =  r2  —  0.48r  —  0.48  =  0.  The  solution  of  this  quadratic  equation  gives 
r  =  0.97.   In  this  example  there  is  perfect  multiple  correlation  of  X0  with  Xx 
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and  X2,  in  the  sense  that  all  the  observed  points  lie  in  one  regression  plane, 
even  though  the  individual  correlations  of  X0  with  Xt  and  X2  separately  are 
not  large. 

The  variance  of  some  future  observed  value  x0  corresponding  to  assigned 
values  xl9  x2  •  • .  xp  of  the  predictors  (this  set  not  being  any  of  the  N  sets  of 
values  already  used  in  computing  the  correlations)  is  given  by  Eq.  (12.4.8).  In 
the  notation  of  the  present  chapter,  and  with  the  origin  placed  at  the  sample 
mean,  the  matrix  A  of  §  12.1  is 


(13.2.6) 


A  =  (N  -  1) 


N 

N-1 

0 

0 

0 

si2        • 

•  •  rlpSlsp 

0 

r12sls2     . 

'   •  r2pS2Sp 

6 

ripStSp     . 

..   »/  . 

For  the  special  case  p  =  2,  d(A)  =  N(N  -  \)2s12s22(\  -  r122)  and 
(13.2.7) 

r/v  - 1 

o  o 


A~1=(N-1) 


so  that 


-i 


N 

o  *r2(i  -  ri22r\       -5r152-1r12(l  -  rl22yl 

0      -sr^-V^a-r^2)"1,         s2-2(l-r122)-1 


2/0    2     ,    v    2/„    2 


m9«         Wrv      Ji    ,    *    ,  W  +  *2W  ~  2r12jc1x2/(s1s2)] 
(13.2.8)         K(x0)  -  *  [1  +  jf  + (N_m_ri22) J 

The  multiple  correlation  coefficient  may  be  regarded  as  the  ordinary  corre- 
lation coefficient  between  the  observed  and  the  computed  values  of  A"0,  the 
latter  being  given  by  Eq.  (13.1.10).  By  using  Eq.  (4)  for  the  variance  of  the  N 
computed  values,  we  obtain  for  this  correlation  coefficient: 

(13-29)      r - " Iw^mZ « &*****'1]/ h12  '] 

p 
Since  Sxiax0a  =  (N  —  l)r0is0Si  and  £  ^o^oi  =  d(R)  -  ^oo»  this  reduces  to 

i=  1 

(13.2.10)  r--!Ul.-m 

S012  -pi  ^OOJ 

which  is  the  same  as  r012..p. 
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*  13.3  The  Distribution  of  the  Multiple  Correlation  Coefficient    From  Eqs. 
(13.2.2)  and  (13.2.5), 

H3  3n  S(va2)_d(R)_ 

(13.3.1)  2   -  —  =  1  -  r0fl2  . .  p 

whence 

r<U2»,  S(x0a2)-S(i;a2)      S(x0ca2) 


(13.3.2) 


l-r0212..p  S(»a2)  S(t;a2) 


The  numerator  is  the  sum  of  squares  of  the  calculated  values  of  the  variate  X0 
(the  sum  of  squares  due  to  regression),  while  the  denominator  is  the  sum  of 
squares  about  the  regression  plane.  If  the  variates  Xl9  X2  .  . .  Xp  all  have  fixed 
sets  of  values  and  if  X0  is  a  random  normal  variate  independent  of  Xu  X2  .  .  .  Xp 
(which  means  that  the  true  multiple  correlation  coefficient  is  zero),  the  numerator 
and  denominator  are  independently  distributed  as  x2a2  with/?  and  N  -  p  —\  d.f. 
respectively.  Here  o2  stands  for  o02x  2  . .  p,  the  population  variance  about  the  true 
regression  plane.  It  follows  that 

(13.3.3)  F=(JV-p-l)W) 


pS(v.2) 


has  the  F  distribution  with  p  and  N  —  p  —  \  d.f.  This  means  that  r0\2 .  .p  is  a 
beta-variate  with  parameters  p/2  and  (N  —  p  —  l)/2  and  so  its  distribution  is 
identical  with  that  of  the  squared  correlation  ratio  Eyx2  (see  §  12.10)  with/?  +  1 
instead  of  p.  The  slight  change  arises  from  the  fact  that  we  are  now  dealing 
with/7  +  1  variates  altogether,  namely,  X0  (or  Y)  and  Xu  X2  .  .  .  Xp. 

The  distribution  of  the  multiple  correlation  coefficient,  when  the  corres- 
ponding coefficient  p0tl2..p  in  the  population  is  not  zero,  was  worked  out  by 
Fisher  [1].   The  density  function  is 

(13.3.4)  f(r2)  =  (1  -  P2)(N"1)/2(1  -  r2)(^-p-3)/2(r2)(p-2)/2^(r2) 

where  r2  and  p2  are  written  for  the  squares  of  the  multiple  correlation  coefficient 
in  the  sample  and  in  the  population,  and  where 

,  2       F[(/V-l)/2,(JV-l)/2,p/2;Pyj 
g(r  B[p/2,  (N-p-  l)/2] 

the  numerator  being  a  hypergeometric  function  and  the  denominator  a  beta 
function. 

The  definition  of  the  hypergeometric  function  as  a  series  is 

ab   z      a(a  +  l)b(b  +  1)  z2 

(13.3.5)  F(a,  b,c;z)  =  \+  —  -  +  '\  '■  -  +  . . . 

c    V.  c(c  +  1)         2! 

f  Mn[foL  *" 

£o     [c]„     n\ 


13.4  SOME  REMARKS  ON  MULTIVARIATE  PROBLEMS  361 

where   [a]n  =  a(a  +  l)(a  +  2)  .  .  .  (a  +  n  —  1).     For   the   properties   of  this 
function  see  references  [2]  and  [3]. 
The  expected  value  of  r2  is  given  by 

(13.3.6)  E(r>)  =  1-^^(l-,Wl,l"  +  1 


N  -  1 


/1    1   N  +  1        \ 


When  p  =  0  this  reduces  to 
(13.3.7)  E(r2)  = 


P 


N-l 


It  may  be  noted  that  Fisher's  z'-transformation  (§  11.13)  can  be  applied  to 
the  multiple  correlation  coefficient  and  brings  about  approximate  normality,  for 
moderately  large  N.   The  variance  of  z'  is  p/(N  —  p  —  2),  approximately. 

13.4  Partial  Correlation  Sometimes  we  would  like  to  know  what  the 
correlation  would  be  between  say  X0  (or  Y)  and  X1  if  the  influence  of  all  other 
variates  such  as  X2,  X3  . . .  Xp  were  eliminated.  This  is  called  the  partial  corre- 
lation of  X0  and  Xu  and  the  coefficient  is  written  r01  t2..p-  It  is  in  general 
different  from  the  ordinary  correlation  coefficient  r01  for  X0  and  Xv  For 
simplicity  we  consider  the  case  of  three  variates,  X0,  Xl  and  X2i  for  which  the 
three  pairwise  Pearson  coefficients  of  correlation  are  r01,  r02,  r12,  and  we 
suppose  that  X2  is  the  variate  to  be  eliminated.  The  effect  of  X2  on  X0  is 
estimated  from  the  ordinary  regression  of  X0  on  X2i  ignoring  Xl3  and  is  given 
by  ro2soxils2  f°r  a  measured  value  x2.  (Each  variate  is  supposed  measured 
from  its  own  mean  as  origin.)  The  residual  part  of  the  observed  jc0,  after  sub- 
tracting the  part  due  to  X2i  is 

x2 

(13.4.1)  x0.2  —  x0  —  Po2so 

s2 

In  the  same  way,  the  residual  part  of  the  observed  xt  is 

x2 

(13.4.2)  x1#2  =  Xj  —  rl2sl  — 

s2 

The  partial  correlation  coefficient  r01f2  is  defined  as  the  ordinary  correlation 
coefficient  for  x0.2  and  x^.  That  is 

n  S(x0.2)a(xV2)a 

(  }  0i'2~(N-l)s0.2Sl.2 

the  numerator  being  summed  over  all  the  JV  sets  of  observations.  In  the  de- 
nominator, s022  is  the  residual  variance  of  x0  after  eliminating  the  regression 
on  x2,  so  that 

(13.4.4)  5022=s02(l-r022) 
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Similarly, 

(13.4.5)  Si22=*i2(l-ri22) 
Using  Eqs.  (1)  and  (2),  we  find 

(13.4.6)  Sa(x0.2)a(xV2)a  =  Sx0axla  -  r02  —  Sxlax2a 

Sj  ^0"^1   o         2 

-  r12  —  Sx0ax2a  +  r02r12      j  ^x2a 
s2  s2 

=  (N  -  ljfrdSoSi  -  r02  —  r12sts2 
\  s2 


si 

—  r12    r02S0S2  +  r02ri 


12S0S1  I 

(N  -  ljsos^roi  -  r02rl2) 


Therefore, 


niAi\  *>       -  rQ1  ~  r°2ri2 

(OA7)  r°U1  ~  [(1  -  r022)(l  -  r12*)]>/* 

This  expresses  the  partial  correlation  coefficient  in  terms  of  the  three  ordinary 
correlation  coefficients.  In  terms  of  the  correlation  matrix  R,  as  defined  in 
Eq.  (13.1.6), 

.(13.4.8)  r01,= 


(R00Ru)m 
and  this  form  can  be  generalized  for  p  variates  Xl9  X2  . .  .  A^.  Thus 


(13A9)  r01>2 


(Koo*ii)1/2 


where  the  correlation  matrix  now  has  p  +  1  rows  and  columns. 

In  certain  circumstances  the  partial  correlation  coefficient  r0l2  is  the  same 
as  the  ordinary  coefficient  r01  when  the  third  variate  X2  is  held  constant.  That 
is,  if  we  select  out  of  the  set  of  all  observations  a  subset  in  which  X2  =  x2  (very 
nearly),  and  calculate  r0l  for  this  subset,  we  get  in  effect  r0l2.  In  general  this 
is  not  true,  and  the  result  depends  on  the  chosen  value  x2;  the  calculated  r01 
is  equal  to  r01t2  if  and  only  if  (a)  the  bi variate  regression  of  X0  on  X2  (ignoring 
Xx)  is  linear,  with  the  variance  of  ^.constant  for  all  X2\  (b)  the  trivariate 
regression  of  X0  on  Xx  and  X2  is  linear  with  the  variance  of  X0  constant  for 
all  Xt  and  X2.  These  conditions  are  not  likely  to  be  satisfied  very  precisely  in 
practical  applications,  and  the  calculated  value  of  r01>2  will  be  a  sort  of  average 
of  the  correlations  r0l  that  would  be  obtained  for  different  assigned  values 
of  x2  (see  Problem  A-6). 
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Example  3  The  following  results  were  obtained  at  Syracuse  University  in 
an  investigation  by  M.  A.  May  of  the  factors  influencing  "academic  success." 
The  sample  consisted  of  450  students  and  the  variates  were  X0  (honor  points), 
Xx  (general  intelligence)  and  X2  (hours  of  study  per  week).  One  object  of  the 
investigation  was  to  find  to  what  extent  honor  points  were  related  to  general 
intelligence,  when  the  effect  of  varying  study  periods  was  eliminated.  From 
the  data, 

X0  =  18.5,        X,  =  100.6,        X2  =  24 

s0  =  11.2,  S!=15.8,  s2=6.0 

r01  =  0.60,        r02  =  0.32,  r12  =  -0.35 

We  find  from  Eq.  (7)  that  r01  2  =  0.80.  The  multiple  correlation  of  X0  on  XY 
and  X2t  as  given  by  Eq.  (13.2.5),  is  r0l2  =  0.82.  The  regression  coefficients, 
from  Eq.  (13.1.7),  are  bt  =  0.58,  b2  =  1.13. 

The  sampling  distribution  of  the  partial  correlation  coefficient  r0l2  is  the 
same  as  that  of  r0l  (see  §11.12)  but  with  N  —  1  instead  of  N  and  with  p01  2 
instead  of  p.  With  p  +  1  variates,  X0,  Xx  .  . .  Xp,  the  density  function  for 
r01>2 .  .p  has  N  —  p  +  1  instead  of  N. 

13.5  The  Multivariate  Normal  Distribution  The  univariate  normal  distri- 
bution has  the  density  function  (when  the  variable  is  standardized  so  as  to  have 
mean  zero  and  variance  unity) 

(13.5.1)  f{x)  =  (27t)"1/2  exp(-x2/2) 

The  importance  and  central  position  of  this  distribution  in  statistical  theory 
have  already  been  emphasized  in  earlier  chapters.  A  corresponding  position  in 
multivariate  theory  is  taken  by  the  standardized  multivariate  normal  distri- 
bution, with  joint  density  function 

(13.5.2)  /(x0,  x,  .  .  .  xp)  =  (27r)-<^1)/2|rfC4)|1/2  exp(-6/2) 
where 

(13.5.3)  Q  =  x'Ax 

Here  x'  is  the  row  vector  (x0  xx  .  .  .  xp)  and  x  is  its  transpose  (a  column 
vector),  while  A  is  a  symmetric  positive  definite  matrix  of  p  +  1  rows  and 
columns.  It  is,  in  fact,  the  inverse  of  the  correlation  matrix  P. 


(13.5.4) 


1       Poi     •  • •     Pop 
Poi         1        •  • .      Pip 


.Pop     Pip     •  •  •     PppA 

For  the  bivariate  case,  in  agreement  with  the  customary  notation,  we  will 
write  xx  =  x,  x0  =  >>,  p01  =  p.  Then 
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P  = 


P     1 


A=(l 


d{A)={\-p2y 


(i  -  p2)Q  =  [>] 


.;  "M- 


so  that 
(13.5.5) 


f(x,y)=(^)-\l-p2ylllexp  - 


-P2)-1 

1     -p 
-P        1. 

=  x2  +  y2  -  2pxy 

x2  +  y2  —  2pxy 

i          2(1  -p2)      J 

This  is  the  bivariate  normal  distribution  in  standardized  form.  Tables  of 
this  function  for  selected  values  of  p  may  be  found  in  references  [4]  and  [5].  A 
method  has  been  given  [6]  for  reducing  the  integral  of  the  multivariate  function 
to  a  bivariate  integral  and  thereby  obtaining  numerical  values. 

*  13.6  The  Relation  of  the  Multivariate  and  Multinomial  Distributions    We  saw 

in  Chapter  3  that  with  increasing  sample  size  the  binomial  distribution  with 
parameter  d  tends  to  a  normal  distribution  with  mean  NO  and  variance 
N9(l  —  9).  A  similar  result  holds  for  the  multinomial  distribution  (Appendix 
A.  16  and  17).  If  the  probability  that  a  random  item  from  the  population  belongs 
to  the  /th  class  is  nt  (i  =  1,  2  .  .  .  k\  the  observed  frequencies  fv  in  the  various 
classes  will  tend,  as  the  total  sample  size  N(=  £/£)  increases,  to  a  multivariate 
normal  distribution  with  means  Afaf  and  variance-co variance  matrix  V,  where 


(13.6.1) 


V  =  N 


7^(1  -  71^ 


7ti7l- 


71271!  7l2(l  -  7t2) 


TTjTT* 


7U7T 


2"fc 


71*7^ 


7tk7l2 


rc*(l  ~  7lk). 


Because  of  the  fact  that  £  7cf  =  1,  we  have  d{V)  =  0,  which  means  that  the 
matrix  V  is  singular  and  cannot  be  inverted.  However,  we  may  omit  one  of  the 
variates  f{  (say  fk)  and  express  it  in  terms  of  the  remaining  k  —  1  variates  by 
the  relation  fk  =  N  — /i  —  f2  —  ...  —fh-v  These  k  —  1  frequencies  are 
multinormally  distributed  with  means  Mr,-  and  variance-co  variance  matrix  V*, 
which  is  just  V  with  the  last  row  and  the  last  column  omitted.  The  inverse 
of  V*  is 


(13.6.2)   (CT1^ 


-1 


(*!+**) 


*k 


-1 


nk 


nk 


(**- 


+  *k     ')_ 
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as  may  be  checked  by  multiplying  it  with  V*,  bearing  in  mind  that 
nk  =  1  —  YX~lni'  The  quantity  g  of  Eq.  (13.5.3)  which  appears  in  the  exponent 
of  the  multivariate  normal  distribution  is,  in  this  case, 

(13.6.3)  Q  =  (fi  -  NndXV*)-1^  -  Nnt) 

where  (/f  —  Nntf  is  a  row  vector  of  k  —  1  elements  and  {ft  —  Nn^)  is  the 
corresponding  column  vector.   On  carrying  out  the  multiplication,  we  find 

fc-i  (f  _  Nn\2       ft-i  (f  _  Nn)(f-  -  Nn-) 

(13.6.4)  Q  =  £  Ut   -^    +    X    ~  I!   J  J 


_yqf-N7r,-)2  [ 


[lift  -  N*i)] 


Niti  Nnk 

As  shown  in  Appendix  A.  17,  Q  is  approximately  distributed  as  x2  with  k  —  1 
degrees  of  freedom.  This  fact  is  the  basis  of  the  ordinary  x2  test  for  goodness 
of  fit,  already  discussed  in  §  10.3. 

*  13.7  Hotelling's  Generalization  of  Student's  t  Hotelling  [7]  investigated  the 
properties  of  a  statistic  T  which  generalizes  Student's  t  for/?  variates.  We  recall 
that  Student's  t  is  the  ratio  of  the  difference  between  the  sample  mean  and  the 
population  mean  to  the  estimated  standard  deviation  of  the  sample  mean,  so 
that 

(13.7.D  t>  =  m-tf 

V 

Hotelling's  T  is  a  standardized  measure  of  the  departure  of  all  the  p  sample 
means  from  their  population  values.    Let 

(13.7.2)  Sij  =  S(xia-xi)(xja-Xj) 

and  let  (Sij)  be  the  inverse  of  the  matrix  (5V).   Then  T  is  defined  by 

(13.7.3)  T2  =  N(N  -  1)  X  SiJ(xt  -  ^{xj  -  /*,-) 

where  N  is  the  sample  size.  It  is  easily  seen  that  when  p  —  \>T2  reduces  to  t2. 
For  in  this  case,  if  xt  =  xy 

Stl=S(xa-x)2=(N-l)sx2 

so  that 

S^KN-IK2]"1 
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and 

T1  - 


,2    N(x-fiy 


This  statistic  T,  given  by  Eq.  (3),  may  be  used,  just  as  t  is  used,  to  test  the 
null  hypothesis  that  fit  =  /ii0(i  =  1,  2  ...  p)  in  the  population,  nt  being  the  true 
value  of  the  mean  of  Xh  and  fii0  an  assumed  value.  On  the  assumption  that  the 
population  is  multivariate  normal,  with  co variance  matrix  (cry),  the  null  hypo- 
thesis H0  is  rejected  when  T2  >  T02,  where 

(13.7.4)  T2  =  N(N  -  1)  £  S^x,  -  fii0)(xj  -  fiJ0) 

and  where  T02  is  chosen  so  that  the  probability  that  T2  >  T02,  when  H0  is 
true,  is  equal  to  some  assigned  value  a. 

Hotelling  showed  that  under  H0  the  quantity  u  =  T2/(N  —  1)  is  a  beta- 
prime  variate  with  density  function 

1  u{p-2)l2 

(13.7.5)  f(u)  =     ,_    „      -  .-    ; 


4  ^) 


(l  +  uy 


This  is  equivalent  to  the  statement  that  [T2(N  —  p)]/[(N  —  l)p]  has  the 

ordinary  F  distribution  with  p  and  N  —  p  degrees  of  freedom.  When  H0  is  not 

true,  and  therefore  fit  —  ni0  is  not  zero,  the  distribution  is  non-central  F,  with 

N 
the  same  degrees  of  freedom  and  with  non-centrality  parameter  X  =  —  £  alj 

2  i,j 

(Hi  —  fiio)(fij  —  Hjo).   The  non-central  F  distribution  has  the  density  function 
(13.7.6)      /(F)  =      P  =  -         l2      ^     IW-^ 


AT  -  p  r[(JV  -  p)/2]  A  Dlr,/p  .   „\  /.         pF   \N'2+f 


»%**){* +•£$ 


This  reduces  when  X  =  0  to  the  ordinary  central  form  as  in  Eq.  (8.15.5).  Tables 
calculated  by  Tang  (see  §§9.12  and  12.10)  give  the  probability  of  accepting  the 
null  hypothesis  when  it  is  not  true,  for  various  values  of  X  and  for  significance 
levels  0.05  and  0.01.  His  number  of  degrees  of  freedom  ft  is  our  /?,  his  f2  is 
our  N  —  p,  and  his  non-centrality  parameter  <j>  is  related  to  our  X  by 

(13.7.7)  (p  +  \)4>2  =  2X 

Also  the  variate  which  Tang  denotes  by  E2  is  our  T2/(T2  +  N  -  1).  This  has 
the  same  distribution  as  the  correlation  ratio,  which  is  the  reason  for  Tang's 
notation.  Further  details  of  the  T  test  and  its  optimum  properties  may  be 
found  in  [8]. 

13.8  Discriminant  Functions    Suppose  we  wish  to  assign  several  individuals, 
on  the  basis  of  a  measured  variate  X,  to  one  or  other  of  two  populations  A 
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and  B  which  differ  in  their  means  but  whose  distributions  may  overlap  (Fig.  54). 
If  the  mean  of  A  0^)  is  greater  than  the  mean  of  B  (n2),  we  would  naturally 
assign  an  individual  with  a  high  value  of  X  to  population  A  and  one  with  a 
low  X  to  population  B.  If  the  curves  representing  the  density  functions  for  the 
two  populations  intersect  at  X  =  a,  we  might  well  take  a  as  the  dividing  point. 
There  will,  of  course,  be  a  certain  risk  of  mis-classification.  The  probability 
of  classifying  an  individual  who  is  really  an  A  as  belonging  to  B  is  FA{(x),  where 
FA(x)  is  the  distribution  function  for  population  A.  Similarly  the  probability 
of  classifying  a  B  as  an  A  is  1  -  FB(<x). 


Fig.  54    Discrimination  between  two  populations 


In  practice  we  often  have  several  variates  Xl9  X2  .  .  .  Xp  which  may  be  used 
for  discrimination,  and  the  problem  then  arises  of  choosing  the  best  function 
of  these  variates  for  discriminating  with  the  least  error  between  populations  A 
and  B.  An  example  is  the  use  of  intelligence,  aptitude  and  achievement  tests  of 
various  kinds,  along  with  high  school  records,  for  attempting  to  assess  whether 
a  student  planning  to  enter  a  university  is,  or  is  not,  likely  to  graduate  in, 
say,  engineering.  A  student  adviser  would  be  glad  to  have  available  a  suitable 
function  of  the  various  test  scores  to  assist  him  in  coming  to  a  decision.  A 
function  of  this  kind  is  called  a  discriminant  function.  Individuals  with  a  value 
of  the  function  greater  than  some  fixed  value  a  will  be  classed  as  A,  those  with 
a  smaller  value  as  B. 

Let  us  assume  that  we  want  a  linear  function  of  the  measured  xi9  say 


(13.8.1) 


L=E*i 


f  =  l,2 


and  would  like  to  choose  the  b{  so  that  L  will  be  as  efficient  as  possible  in  dis- 
criminating between  the  two  populations.  Suppose  that  of  N  sample  items 
available  (on  each  of  which  p  measurements  are  made  and  for  each  of  which 
the  proper  classification  is  known)  there  are  Nx  from  population  A  and  N2 
from  population  B. 
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The  measurements  on  variate  Xt  from  population  A  will  be  denoted  by 
xlia,  a  =  1,2...  Nu  and  those  from  population  B  by  x2ip,  ($  —  1,2...  N2, 
and  it  is  assumed  that  the  two  sets  xlf  and  x2i  have  each  a  /?- variate  normal 
distribution,  independent  of  each  other,  with  means  fili9  fi2i  respectively  and 
a  common  covariance  matrix  (<ry)'.  A  single  later  sample  item  gives  the  set  of 
observations  xt  (i  =  1,  2  .  .  ./?)  and  it  is  known  that  this  item  belongs  to  either 
A  or  B  but  it  is  not  known  to  which.  The  discriminant  function  helps  us  to 
make  the  decision. 

The  null  hypothesis  Hx  is  that  the  new  item  belongs  to  A;  the  alternative 
hypothesis  H2  is  that  it  belongs  to  B.  By  the  Neyman-Pearson  theorem  (see 
§  6.9)  the  most  powerful  critical  region  for  testing  Hx  against  H2  is  given  by 

(13.8.2)  Pi;*"  X2  ■  •  ■  X>\  <  k 

p2(xu  x2  .  .  .  xp) 

where /?!  and/?2  denote  the  joint  probability  density  functions  for  (xu  x2  .  .  .  xp) 
under  H1  and  H2  respectively,  and  where  k  is  a  constant  determined  by  the 
size  of  the  critical  region. 

With  the  assumptions  we  have  made, 


(13.8.3) 


Pl  =  (27i)-^2[J((7l7)]-1/2  exp[-i  X  o»(xt  -  fiuXxj  -  Mij)] 
p2  =  (27i)-p/2[^l7)]-1/2  exp  J-i  X  ^(x,  -  /%)(*,•  "  A*2;)] 


where  ^(c^)  is  the  determinant  of  the  matrix  (<7y).   Then  Eq.  (2)  is  equivalent 
(on  taking  logs  of  both  sides)  to 

(13.8.4)        X  *IJt(*i  "  /*2-)(*;  -  M2i)  "  (*i  -  Mi.X*/  -  i"iy)]  <  2  log  /c 

However,  the  population  parameters  nu,  fi2h  oi}  are  unknown,  and  we  must 
replace  them  by  estimators.   The  optimum  estimators  of  jiu  and  ji2i  are 


(13.8.5)                    Xi.—  Nr'Sx,,-,, 

x2i 

=  N2~ 

0   X2ifi 

respectively,  while  that  of  au  is 

(13.8.6)                                      su  =  — 

S,j 

where 

(13.8.7)  Stj  =  S  (xlia  —  xu)(x2ja  —  3c2j)  +  S  (x2ip  —  x2i)(x2jp  —  x2j) 

On  substituting  these  estimates  in  Eq.  (4)  we  obtain  the  test  statistic 

(13.8.8)  R  =  £  siJ[(Xi  -  x2i)(xj  -  x2j)  -  (xt  -  xu)(xj  -  xu)] 


i, j 
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where  (sij)  is  the  inverse  of  (su).  Now  R  can  be  written 

R=y  s^lx^Xu  -  x2j)  +  Xj(xu  -  x2i)  -  xuxXj  +  x2ix2j] 

ij 

and  because  slJ  =  sJl, 

X  s°xf(5ciy  -  x2j)  =  £  sijxj(xu  -  x2i) 

Also  the  last  two  terms  in  the  square  bracket  do  not  depend  on  xt.  The  test 
R  <  c  is  therefore  equivalent  to 

(13.8.9)  L=yZsijxi(x1j-x2j)<k' 

where  k'  is  a  new  constant,  suitably  chosen.  Let  us  denote  the  difference  between 
the  two  sample  means  for  the  variate  Xt  by  di9  so  that 

(13.8.10)  dt  =  xu  -  x2i 
Then 

L=YJbixi 

i 

with 

(13.8.11)  bt-ZaPdj 

j 

This  function  L  is  a  discriminant  function  for  assigning  the  new  item  to  either 
A  or  B.  If  L  <  k\  we  shall  assign  it  to  B  and  if  L  >  k'  to  A.  It  is  convenient 
to  take  k'  =  \Yui  bi(xu  +  x2i). 

The  same  discriminant  function  is  obtained  by  using  a  different  approach 
(due  to  Fisher).  The  constants  bt  in  the  function  L  are  chosen  so  as  to  make 
the  sum  of  squares  between  populations  for  the  given  sample  items  as  great 
as  possible,  relative  to  the  sum  of  squares  and  products  within  populations.  Of 
course,  it  is  only  the  ratios  of  the  bt  that  matter,  so  a  constant  multiplier  makes 
no  essential  difference. 

It  may  be  noted  that  the  discriminant  function  is  related  to  Hotelling's 
statistic  T.  We  can  define  a  generalized  statistic  T2  for  the  two-population 
case  by 

(13.8.12)  T22  =  X  s^N^Xu  -  xf)(xu  -  Xj)  +  N2(x2i  -  xt)(x2j  -  3c,)] 

tj 

where 

NiXn  +  N2x2i 


(13.8.13)  xt 


Nl+N2 


which  is  the  combined  sample  mean  for  Xi9  and  su  is  given  by  Eqs.  (6)  and  (7) 
above.   The  matrix  (sij)  is,  as  usual,  the  inverse  of  (su).   The  quantity  S(j  is 
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the  sum  of  squares  and  products  within  populations.    By  substituting  from 
Eq.  13  in  Eq.  (12)  we  obtain,  after  a  little  reduction, 


(13.8.15) 


T22  =  v5>"4</y 


where  v  =  (N1N2)/(N1  +  N2)  and  dt  is  given  by  Eq.  (10).  It  is  seen  that,  apart 
from  the  constant  v,  T22  is  the  same  as  L  with  dx  in  place  of  xt. 

Example  4  In  a  certain  experiment  (the  details  have  been  modified  for 
convenience  of  presentation)  18  rabbits  each  received  a  high  dose  of  insulin 
and  18  received  a  low  dose.  The  blood-sugar  was  measured  at  1,  2  and  3  hours 
after  each  dose.  The  three  readings  are  denoted  by  xu  x2  and  x3.  The  aim  is 
to  find  what  linear  combination  of  these  readings  would  be  expected  to  dis- 
criminate most  effectively  between  a  high  dose  and  a  low  dose  of  insulin  in 
another  rabbit. 

The  S.S.  and  S.P.  within  populations  were  as  shown  in  the  following  table: 


s.s. 

S.P. 

XI2 

X22 

X32 

Xl*2 

XlX3 

*2*3 

2677 

2358 

3223 

1278 

1814 

1966 

and  the  values  of  dt  (mean  low-insulin  value  —  mean  high-insulin  value)  were 
7.594,  19.73  and  25.04,  respectively.  The  matrix  (StJ)  is 

2677  1278  1814 
1278  2358  1966 
1814    1966    3223 

and  its  inverse  (SiJ)  is  proportional  to 


3.735 

-0.553 

-1.765 

-0.553 

5.337 

-2.947 

-1.765 

-2.945 

4.679 

The  values  of  bit  b2,  b3  are  therefore  proportional  to  d^S11  +  d2S21  +  d3S31, 
dtS12  +  d2S22  +  d3S32,  and  d^13  +  d2S23  +  d3S33  respectively,  that  is, 
to  —26.7,  27.4  and  45.6.  A  good  approximation  to  the  best  discriminant 
would  accordingly  be 


L  =  —  3xt  +  3x2  +  5x3 

since  26.7,  27.4  and  45.6  are  approximately  in  the  ratio  3:3:5. 

Example  5  (Nature,  168,  1951,  p.  794).  In  order  to  discriminate  between 
fossil  skulls  of  men  and  chimpanzees,  a  discriminant  function  was  calculated 
using  four  distinct  measurements  on  the  lower  milk  canine  teeth.  On  the  basis 
of    44    chimpanzee    and    40    human    teeth,    the    function    obtained    was 
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L  =  xt  —  7A9x2  +  2.34*3  +  4.70x4.  The  average  value  of  L  turned  out  to 
be  + 17.6  for  the  chimpanzee  teeth  and  —5.0  for  the  human,  with  a  standard 
deviation  of  2.45.  It  was  therefore  concluded  that  if  a  tooth  of  unknown  origin 
gave  an  L  between  1 1.5  and  23.7  it  might  very  reasonably  be  classed  as  a  chim- 
panzee's, and  if  L  lay  between  1 . 1  and  —  1 1 . 1  as  human.  For  the  famous  Taungs 
skull,  of  great  antiquity,  L  turned  out  to  be  -  7.9  and  for  the  Kromdraai  skull 
—  2.6,  so  that  both  these  are  probably  human. 

*  13.9  The  Distance  Between  Two  Populations  The  discriminant  function  and 
Hotelling's  generalized  T2  are  both  closely  related  to  a  measure  of  "generalized 
distance"  between  two  populations,  proposed  by  Mahalanobis. 

Suppose  p  variates  Xt  are  measured  on  each  sample  item  from  each  popu- 
lation.  Let  the  population  means  of  Xt  be  nu  and  fi2i>  aiK*  let 


(13.9.1) 


Xu  =  t*u  +  % 

where  the  ef  have  a  multivariate  normal  distribution  (the  same  for  both 
populations)  with  means  0  and  co variance  matrix  (<r0).   If 

(13.9.2)  S^iiu-^ 
the  generalized  distance  is  given  by 

(13.9.3)  A2=2>"Mj 


where  (<rlJ)  is  the  inverse  of  (<7l7).   A  factor  p~    is  sometimes  included  on  the 
right-hand  side  of  Eq.  (3)  but  this  makes  no  essential  difference. 

Since  in  practice  we  have  to  estimate  the  population  means  from  the  sample 
means,  the  formula  for  the  observed  distance  becomes 

(13.9.4)  D2  =  £  aij  dt  dj 

where 

(13.9.5)  dt  =  xu  -  x2i 

The  first  mean  is  calculated  from  a  sample  of  size  Nl9  say,  and  the  second 
from  a  sample  of  size  N2.  If  aXi  is  not  known,  the  estimator  siJt  defined  as  in 
Eqs.  (13.8.6)  and  (13.8.7)  must  be  used.  In  this  case,  we  speak  of  the  studentized 
distance,  Dsi  which  is  given  by 

(13.9.6)  DS2=5>'M<*; 


and  so  is  identical,  apart  from  the  constant  v,  with  Hotelling's  generalized  T 


in  Eq.  (13.8.15). 

.     If  the  true  Mahalanobis  distance  A  is  zero  (so  that  the  two  populations  are 

really  identical),  the  quantity  vD2  has  an  ordinary  x2  distribution  with  p  d.f. 


372  INTRODUCTION  TO  STATISTICAL  INFERENCE  13.10 

If  A  is  not  zero,  vD2  is  distributed  like  non-central  x2  with  parameter  of  non- 
centrality  vA2/2.  (see  Appendix  A.  13). 

For  the  studentized  distance,  it  was  shown  by  Bose  and  Roy  that  if  A  =  0, 
the  quantity 

Ni  +  N2  -  p  -  1  D2 


(13.9.7) 


F  = 


p  Nt+N2-2 

has  the  ordinary  F  distribution  with  p  and  Nx  +  N2  —  p  —  1  d.f.  If  A  is  not 
zero,  it  has  the  non-central  F  distribution  (see  (13.7.6))  with  non-centrality 
parameter  vA2/2. 

For  the  bivariate  Case,  if  we  denote  x1  by  x  and  x2  by  y,  and  if  the  correlation 
between  x  and  y  in  the  population  is  measured  by  p,  we  have 

r  n  2 

rpGXGy, 

(0  =  [^V(i-P2)]-1 

so  that  D2  becomes 


-poxay 


paxay 


iF(5l 


*2)2      (fi  ~  J2)2     2p(xt 


X2X.K1  -  j>2) 


(13.9.8)       Dz=(l-pz) 


which,  when  p  =  0,  reduces  to  the  square  of  the  geometrical  distance  between 
the  two  sample  means  in  the  x-y  plane  (if  standardized  units  for  x  and  y  are 
used). 


13.10  Stochastic  Processes  An  important  part  of  modern  statistical  theory 
is  concerned  with  events  which  change  in  a  random  way  as  time  goes  on. 
An  example  is  the  position  of  a  microscopic  inert  particle  suspended  in  a  fluid, 
and  exhibiting  the  so-called  Brownian  motion.  Another  is  the  size  of  a 
population  affected  by  births,  deaths,  immigration,  emigration,  etc.  These 
and  many  other  such  linked  chains  of  events,  proceeding  in  time  and  subject  to 
random  fluctuations,  are  called  stochastic  processes. 

The  mathematical  model  of  a  stochastic  process  is  a  variate  X(t)  depending 
on  a  parameter  t  which  is  usually  (in  physical  applications)  the  time.  Since  /  is 
continuous,  the  set  of  possible  values  of  X{t)  is  non-countable,  but  in  practice 
observations  are  taken  at  a  finite  set  of  values  tr  (r  =  1,  2  .  .  .  n).  The  corres- 
ponding values  of  X,  say  xlt  x2  •  •  •  *„,  are  the  components  of  an  ^-dimensional 
vector,  having  a  distribution  function  F(xt  .  .  .  xn;  tx  .  .  .  tn),  so  that  A"  is  a 
multivariate  random,  or  stochastic,  variable.  One  of  the  simplest  examples  of  a 
stochastic  process  is  the  "random  walk"  mentioned  in  §  7.6.  The  random  step 
Xr  taken  at  time  tr  in  either  the  positive  or  negative  direction  of  the  x-axis  is 


independent  of  all  the  previous  Xt  at  times  tl9 1: 


The  cumulative  sum 


(13.10.1) 


Sr  =  X,  +  X2  +  .  .  .  +  Xr 
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is  a  stochastic  process,  representing  the  position  of  the  individual  taking  the 
"walk"  at  time  tr.  Thus  if  A  plays  a  set  of  gambling  games  with  B,  receiving  one 
dollar  from  B  each  time  he  wins  and  paying  one  dollar  to  B  each  time  he  loses, 
and  if  the  result  of  any  game  is  independent  of  the  preceding  games,  the  total 
sum  held  by  A  at  the  end  of  r  games,  supposing  he  started  off  with  a  fixed  sum 
m  dollars,  is  a  random  walk  process.  Here  Xr  is  always  either  +1  or  —  1,  and 
sooner  or  later  either  A  or  his  opponent  (if  similarly  situated)  will  be  ruined. 

Sequential  binomial  sampling  is  another,  and  more  respectable,  example,  in 
which  each  step  represents  the  result  of  sampling  one  more  item  from  the 
population  (usually  referred  to  as  a  "lot").  The  step  is  in  one  direction  or 
another  according  as  the  result  of  the  inspection  is  a  "success"  or  a  "failure." 
The  total  number  of  steps  taken  is  a  stochastic  variable  representing  the  total 
number  of  sample  items  required  to  arrive  at  one  or  other  of  two  possible 
decisions  regarding  the  population  sampled  (for  instance,  to  accept  the  whole 
lot  or  to  reject  the  whole  lot). 

13.11  Markov  Processes  A  stochastic  process  is  said  to  be  of  the  Markov 
type  if  the  value  of  X  at  any  time  tr  depends  at  most  on  the  value  at  the 
immediately  preceding  available  time  tr-1.  No  earlier  history  of  the  process 
adds  anything  further  to  the  probable  future  history  of  X.  The  joint  probability 
for  the  observed  set  of  values  xlt  x2  •  .  .  xn  is  therefore  given  by 

(13.11.1)        P(xu  x2  .  .  .  xn)  =  P(x1yP(x2\xlyP(x3\x2)  .  .  .  P^l^.j) 

=  P(xl)flP(xr\xr_l) 

r  =  2 

A  Markov  process  is  defined  by  the  initial  probability  distribution  -P(xj)  and 
the  conditional  distribution  P{xr\xr_x)  for  any  arbitrary  choice  of  the  times  tr. 
It  follows  that 


(13.11.2)  P(xrjcr_2)  = 


P(xr\xr_l)-P(xr_l\xr_2) 

(Xr-l) 


where  the  integral  stands  for  a  sum  over  the  possible  values  of  xr_j  (if  X  is 
discrete)  or  the  ordinary  integral  over  the  whole  domain  of  xr_t  (if  X  is 
continuous).  This  relation  is  known  as  the  Chapman-Kolmogorov  equation. 
By  a  repeated  application  of  the  equation  we  may  obtain  P(xr\x1)  and  then 


(13.11.3)  P(xr) 


POcJxJPOO 


(xi) 


When  the  variate  Xis  discrete  (as  in  many  applications),  the  process  is  called 
a  Markov  chain,  and  a  matrix  notation  is  convenient.  We  suppose  that  at  each 
time  tri  X  can  take  one  of  a  finite  number  n  of  possible  values  xu  x2  .  .  .  xns 
representing  n  possible  states  of  the  system.  The  probability  that  X  changes 
from  x{  to  Xj  between  tr-Y  and  tr  will  be  denoted  by  ptj  (called  a  transition 
probability).  If  we  know  the  matrix  P  of  transition  probabilities,  and  the  initial 
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value  of  X,  we  can  calculate  the  probability  for  any  possible  value  at  any 
future  time. 

Note  that  in  this  matrix 

"~Pll  •  •  •  Pin 


p  = 


LPnl 


the  probabilities  in  each  row  (but  not  necessarily  in  each  column)  add  up  to  1 . 
This  is  because  in  whatever  state  the  system  may  be  at  time  ^r_t  it  must  be  in 
one  and  only  one  of  the  n  possible  states  at  time  tr.  However,  since  the  transition 
probabilities  refer  only  to  transitions  from  xt  to  Xj,  a  similar  argument  does 
not  apply  to  the  columns. 

For  a  Markov  chain,  Eqs.  (2)  and  (3)  become 


(13.11.4) 


P(tr\tr.2)=P2 

P(tr)=p'Pr-1 

where  P(tr)  denotes  the  probability  of  the  various  states  at  time  tr  and  p'  is  a 
row  vector  of  the  probabilities  at  rx  for  the  same  n  states. 

Example  6    A  system  can  be  in  one  of  two  possible  states.    Initially  the 
chance  is  the  same  for  each,  and  at  each  transition  the  probability  matrix  is 


X 

1 

2 

1 

i 

1 

2 

JL 

2 

i 

What  are  the  probabilities  for  the  two  states  after  three  steps  ? 
We  have  p'  =  [i,  i] 


P  = 


P2  = 


5 
1  2 


P3   = 


23 
54 


3  1 

72 


3  1 
54 


41 

72 


P'P> 


247 
432* 


fl  8  5    2471 
l_432>   432J 

The  required  probabilities  are  therefore  ^|4  and 

If  written  as  decimals,  the  rows  of  P3  are  [0.426,  0.574]  and  [0.431,  0.570], 
which  are  nearly  the  same.  A  basic  theorem  on  Markov  chains  states  that  if 
the  matrix  P  is  regular  (that  is,  if  some  power  of  P  has  no  zero  elements)  then, 
as  n  increases,  P"  tends  to  a  matrix  Q,  each  row  of  which  consists  of  the  same 
probability  vector  q\  with  no  zero  element.  Furthermore,  whatever  the  initial 
probability  vector  p\  p'P"  ->  q  and  q'  is  a  unique  fixed  probability  vector 
satisfying  the  relation 


(13.11.5) 


q'P  =  q' 
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In  the  example  above,  the  vector  q  is  [T,  y].  The  interpretation  of  this 
theorem  is  that  after  a  great  many  steps,  the  probability  that  the  system  is  in  a 
state  Xj  is  very  nearly  equal  to  the  j th  element  of  q'  regardless  of  the  initial 
probabilities  of  the  various  states. 

Another  theorem  states  that  in  a  regular  Markov  chain,  with  transitions  at 
unit  time  intervals,  the  average  time  it  takes  to  return  to  a  given  state,  having 
once  been  there,  is  the  reciprocal  of  the  limiting  probability  of  being  in  that  state. 

13.12  Stationary  Processes  A  stationary  process  is  one  whose  distribution 
function  F„(xu  x2  .  .  .  xn\ \tu  t2  .  . .  t„),  for  any  set  of  times  tu  t2  . . .  tn,  depends 
only  on  the  n  —  1  intervals  tr  —  tt  (r  =  2,  3  .  . .  n)  and  so  is  independent  of 
the  absolute  times.  Translation  along  the  time  axis  makes  no  difference  to  the 
probabilities  in  a  stationary  process.  An  example  is  a  quality  control  chart 
where  the  variate  concerned  is  satisfactorily  "in  control."  Also  a  Markov  chain, 
after  a  considerable  number  of  steps,  has  become  practically  stationary. 
Stationary  processes  represent  a  kind  of  stochastic  equilibrium  and  are  there- 
fore of  importance  in  many  practical  situations,  such  as  arise  in  problems  of 
communication  engineering. 

A  process  in  which  the  mean  is  constant  but  the  variance  may  change  is  said 
to  be  stationary  to  the  first  order. 

If  the  mean  and  .variance  are  constant,  as  well  as  the  covariance  between 
values  at  a  fixed  interval  apart,  the  process  is  stationary  to  the  second  order, 
and  so  on. 

Let  us  consider  the  case  when  all  the  observations  are  taken  at  regularly 
spaced  times,  so  that  tr  —  fr_j  =  T  for  all  r.  For  a  second-order  stationary 
process 

(13.12.1)  E(Xr)=n 
E[(Xr-MXr-1-fi)]=o2p(T) 

where  a2  is  the  variance  of  any  Xr  and  p(T)  is  a  function  of  the  fixed  interval  T. 
This  function  is  called  the  autocorrelation  coefficient,  and  is  the  correlation 
coefficient  for  pairs  of  values  of  X  separated  by  the  interval  T.  Since  a  multi- 
variate normal  distribution  depends  only  on  the  first  and  second  moments,  it 
follows  that  a  normal  process  which  is  stationary  to  the  second  order  is  also 
completely  stationary. 

The  simplest  case  of  a  stationary  process  is  a  sequence  of  independent  observa- 
tions such  as  heads  or  tails  in  repeated  tosses  of  a  coin.  Here  p(T)  =  0  for 
every  non-zero  interval  T.  Another  example  is  a  linear  Markov  process,  defined 
by   ' 

(13.12.2)  Xr+s=XsXr  +  Yr+s 

where  Yr+S  is  a  sequence  of  uncorrelated  variates  independent  of  Xr,  for  all  s>  0. 
Since  E(Xr)  =  p  for  all  r,  we  have  from  Eq.  (2), 

p=Xsp  +  py>s 
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where  fiytS  is  the  expectation  of  Yr+S.  Then 

(13.12.3)  Mm-/*1-^ 

If  we  multiply  Eq.  (2)  by  Xr  and  then  take  expectations,  we  get,  on  putting  a2 
for  the  variance  of  Xn  and  ps  for  the  correlation  coefficient  between  Xr  and  Xr+S, 

G2ps  +  fi2  =  Xs{02   +  H2)  +  H'f*y,s 

This  gives,  after  substituting  from  Eq.  (3), 

o2ps  =  Xso2 
or 

(13.12.4)  Xs  =  ps 

Since  the  process  is  Markov,  the  partial  correlation  between,  say,  Xl  and  X3 
(eliminating  X2)  is  zero.  By  Eq.  (13.4.7)  this  implies  that  p13  —  p12p23  —  0,  where 
p13  is  the  ordinary  correlation  coefficient  for  Xt  and  X3.  Since  p12  and  p23 
are  both  equal  to  px  (5  being  1  in  both  cases),  it  follows  that  pl3  =  px2,  and, 
in  general, 

(13.12.5)  ps=Pls 

This  relationship  among  the  correlation  coefficients  is  characteristic  of  the 
stationary  linear  Markov  process. 

We  cannot  enter  further  into  the  topic  of  stochastic  processes,  with  its  many 
applications  to  economic  time-series,  population  and  genetic  problems,  com- 
munication theory  and  traffic  engineering,  cosmic  ray  showers  and  thermo- 
dynamics, to  mention  only  a  few  of  the  directions  in  which  the  theory  has  been 
applied.  Those  who  wish  some  further  insight  may  consult  reference  [9]  which 
contains  a  fairly  full  bibliography.   See  also  [10]. 


PROBLEMS 
A.  (§§13.1-13.4) 

1.  Given  that,  for  a  group  of  children  between  the  ages  of  8  and  14,  the  ordinary 
coefficients  of  correlation  between  intelligence  and  school  achievement,  between 
intelligence  and  age,  and  between  school  achievement  and  age,  are  0.80,  0.70  and  0.60, 
respectively,  calculate  the  correlation  coefficient  between  intelligence  and  school 
achievement,  eliminating  the  effect  of  age. 

2.  In  Problem  A-3  of  Chapter  12,  calculate  the  three  partial  correlation  coefficients 
and  also  the  multiple  correlation  coefficient  of  Y  on  X  and  Z.  Is  this  multiple  correla- 
tion significantly  different  from  zero  ? 

3.  In  calculating  correlation  coefficients  between  three  variables,  a  student  obtained 
the  values  roi  =  0.6,  r02  =  —0.4,  ri2  =  0.7.  Is  there  good  reason  to  suspect  these 
values?  Why?  Hint:  Calculate  ro,i2  from  the  data. 
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4.  In  Example  3  of  §  13.4,  calculate  the  partial  coefficients  of  correlation  between 
Xo  and  X2  and  between  Xi  and  X2. 

5.  From  the  data  of  Problem  A-5  of  Chapter  12,  calculate  all  the  ordinary  coeffi- 
cients of  correlation,  the  partial  coefficients  of  correlation  between  Y  and  each  of 
Xi,  X2,  and  X3,  and  the  multiple  correlation  of  Y  on  Xi,  X2  and  X3. 

6.  (Pearl  and  Surface)  In  a  biometric  study  of  egg  production  in  the  domestic  fowl, 
measurements  of  length,  breadth  and  weight  (Xo,  Xi,  X2,  respectively)  were  made  on 
453  eggs.  From  all  these  the  value  of  roi,2  was  —0.8955.  If  the  42  eggs  weighing  from 
53  to  53.9  gm  are  considered  alone,  the  ordinary  coefficient  of  correlation  roi  between 
length  and  breadth  is  —0.9117;  similarly  for  the  46  eggs  between  56  and  56.9  gm, 
r01  =  -0.8911,  and  for  the  13  eggs  between  62  and  62.9  gm,  roi  =  -0.8739.  Show 
that  the  weighted  mean  of  these  values  of  roi  is  nearly  equal  to  roi,2  (compare  §  13.4, 
immediately  before  Example  3). 

7.  If  all  the  ordinary  correlation  coefficients  in  a  set  of/?  +  1  variates  Xo,  Xi .  .  .  XP 
are  equal  to  r,  show  that  the  partial  correlation  coefficients  roi,2...P,  ro2,i...p,  etc.  are 
each  equal  to  r/[l  +  (p  —  l)r]  and  that  the  multiple  correlation  of  Xo  on  Xi,  X2 . . .  Xv  is 
given  by 

1  -  rV»...p  =  0  ~r)-    }1~Pr  ,, 
1  +  (p  -  \)r 

Hint:  Show  that  the  determinant  d(R)  is  equal  to  (1  —  r)p(\  +  pr)  and  that  Roo  = 

(1  -  r)P-i[l  +  (p  -  l)r],  R01  =  -r(l  -  r)*-\  etc. 

8.  With  three  variates  Xo,  Xi  and  Xz,  show  that  the  correlation  coefficient  between 
the  residuals  xo.12  and  x  1.20  is  equal  and  opposite  to  that  between  ^0.2  and  X1.2.  Hint: 
*o.i2  is  the  same  as  the  va  of  §  13.2,  and  xi.20  is  the  residual  for  the  multiple  regression 
of  Xi  on  X2  and  Xo.    - 

B.  (§§  13.5-13.7) 

1.  Write  out  the  joint  density  function  for  the  trivariate  normal  distribution,  taking 
xo  =  x,  xi  =  y,  X2  =  z  and  putting  poi  =  pxy,  etc.  The  variables  may  be  supposed  all 
standardized. 

2.  Show  that,  if  X  and  Y  are  independent  normal  variates  with  zero  means  and 
variances  ox2,  o>2  respectively,  the  bivariate  normal  surface  is  cut  by  a  plane  through 
the  z-axis  in  a  curve  for  which  the  points  of  inflexion  lie  on  the  elliptic  cylinder  x2/ox2  + 
y2/aY2  =  1.  Hint:  The  equation  of  the  plane  is  y  —  mx.  The  points  of  inflexion  are 
given  by  d2z/dx2  =  0,  where  z  =  f(x,  y). 

3.  If  the  variates  Xi,  X2 .  . .  Xn  (—  00  <  Xi  <  00)  are  independent  and  have  a 
joint  density  function  which  is  a  function  of  xi2  +  X22  +  .  .  .  +  xN2  only,  show  that  the 
Xi  must  be  normal  with  mean  zero  and  common  variance.  Hint:  The  functional 
equation  f(x)  f(y)  =  f(x  +  y)  has  the  solution  f(x)  =  ecx. 

4.  Let  Xi,  X2  have  a  joint  bivariate  normal  distribution  with  means  zero  and 

co variance  matrix  C  =  L^11     J~2  .  Write  out  the  density  function  for  X2,  given  Xi. 


Hint:  f(x2\xi)  =f(xi,  X2)/f(xi).  When  the  variates  are  not  standardized,  the  matrix 
A'1  of  Eq.  (13.5.4)  is  replaced  by  the  covariance  matrix. 

5.  If  Xi,  X2,  X3  have  a  joint  trivariate  normal  distribution,  with  covariance  matrix 

K= 

calculate  the  expectation  of  Xi,  given  X2  and  Xz. 

6.  In  the  following  table  from  Student's  1908  paper  [11],  xi  and  X2  represent 
additional  hours  of  sleep  obtained  by  the  use  of  soporific  drugs  A  and  B  respectively 
on  certain  patients. 


Cn 

C12 

Cl3 

C21 

C22 

C23 

C31 

C32 

C33. 
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Patient 

Xl 

*2 

1 

1.9 

0.7 

2 

0.8 

-1.6 

3 

1.1 

-0.2 

4 

0.1 

-1.2 

5 

-0.1 

-0.1 

6 

4.4 

3.4 

7 

5.5 

3.7 

8 

1.6 

0.8 

9 

4.6 

0.0 

10 

3.4 

2.0 

Assuming  that  each  pair  of  observations  of  xi  and  X2  for  a  given  patient  is  from  a 
bivariate  normal  distribution,  use  the  Hotelling  T  test  to  test  the  hypothesis  at  signifi- 
cance level  0.01  that  neither  drug  really  produces  any  soporific  effect.  (The  following 
results  of  computation  may  be  used : 

xi  =  2.33,    X2  =  0.75,    Sii  =  36.08,    S22  =  28.80,    S12  =  25.63 

The  null  hypothesis  is  that  /mo  =  /X20  =  0). 

C.  (§§  13.8-13.9) 

1.  Two  treatments  were  applied  to  experimental  forage  plots,  in  15  randomized 
blocks,  each  consisting  of  two  plots,  so  that  both  treatments  were  used  once  in  each 
block.  The  variable  was  the  amount  of  Dutch  clover  in  the  forage  stand,  and  this 
was  estimated  by  two  methods — (1)  using  a  mechanical  counter  and  (2)  by  eye.  The 
means  for  the  differences  of  the  two  treatments  by  the  two  methods  were  d\  —  1.34, 
dz  =  1.06,  and  the  sums  of  squares  and  products  within  the  two  treatments  were  Sn  = 
20.44,  5*22  =  6.41,  S12  =  4.89.  Calculate  the  best  discriminant  function  of  the  form 
bixi  +  62JC2  for  distinguishing  between  the  two  treatments. 

2.  (Johnson  [12])  Two  populations  of  students  taking  university  physics  are 
distinguished  as  (1)  those  taking  the  standard  elementary  course  and  (2)  those  taking 
a  somewhat  more  advanced  course  intended  for  better-prepared  students.  Discrimina- 
tion is  made  on  the  basis  of  three  measurements,  Xi  (a  mathematics  test  score),  X2 
(the  A.C.E.  test  score)  and  X3  (the  student's  honor-point  ratio).  For  1 1 1  students  in 
the  first  course  and  257  students  in  the  second,  the  following  results  were  recorded : 


Course  (1) 

Course  (2) 

N 

111 

257 

Xl 

87.640 

92.397 

X2 

31.081 

56.074 

X3 

1.1586 

1.2689 

S(xi  -  xi)2 

53136 

194356 

S(X2  ~  X2)2 

11616 

15864 

S(X3  -  X3)2 

51.85 

120.39 

S(Xl   —  *l)(*2   —  X2) 

4863 

17878 

S(Xl   —  Xl)(X3  —  X3) 

485.5 

1844.0 

S(X2  —  ^2)(^3  —  ^3) 

243.8 

836.6 
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Calculate  the  best  discriminant  function  for  distinguishing  between  the  courses.  If 
a  new  student  comes  along  with  the  scores  xi  =  80,  X2  =  40,  *3  =  1.5,  to  which 
course  should  he  be  assigned?  Hint:  Calculate  L\  =  ^ibixi  for  course  (1),  and 
L2  =  J£<  biXi  for  course  (2).  If  the  observed  L  <  £(Xi  +  £2),  the  student  should  be 
assigned  to  course  (1). 

3.  R.  A.  Fisher  [13]  has  discussed  the  separation  of  two  species  of  iris,  namely,  (1) 
versicolor  and  (2)  setosa.  The  criteria  are  Xi  (sepal  length),  X2  (sepal  width),  X3  (petal 
length)  and  X*  (petal  width),  all  in  centimeters.  The  data  on  50  specimens  of  (1)  and 
50  specimens  of  (2)  are  summarized  as  follows : 


(*,<i>) 


5.936 
2.770 
4.260 
1.326 


(**<2>) 


5.006 
3.428 
1.462 
0.246 


9S(Sij)  = 


19.1434 
9.0356 
9.7634 
3.2394 


9.0356 

11.8658 

4.6232 

2.4746 


9.7634 

4.6232 

12.2978 

3.8794 


3.2394 
2.4746 
3.8794 
2.4604 


The  inverse  matrix  is 


(j*0  =  98 


0.11872 
-0.06687 
-0.08162 

0.03964 


0.06687 
0.14527 
0.03341 
0.11075 


0.08162 
0.03341 
0.21936 
0.27202 


0.03964 
0.11075 
0.27202 
0.89455 


Calculate  the  discriminant  function,  and  state  the  criterion  for  allotting  another  speci- 
men of  iris  to  the  one  species  or  the  other. 

4.  Calculate  Hotelling's  T22  for  the  data  of  Problem  C-3,  and  test  its  significance 
by  the  F  test.  Hint:  On  the  null  hypothesis  that  the  two  populations  have  the  same 
vectors  of  means,  (fii{1))  and  (/lu(2)),  the  quantity  T22  (Ni  +  N2  -  p  -  l)/[p(Ni  + 
N2  —  2)]  has  the  F  distribution  with  p  and  Afi  +  N*  —  p  —  1  d.f. 

D.  (§§  13.10-13.12) 

1.  In  Example  6  of  §  13.11,  calculate  P4,  and  hence  obtain  the  probabilities  for  the 
two  possible  states  of  the  system  after  four  transitions. 

2.  If  the  transition  probabilities  for  a  Markov  chain  are  given  by 


0  1 

1  0 
0  0 
0  0 


and  the  initial  state  has  probabilities  (£,  £,  ^,  ^)  what  are  the  probabilities  after  one, 
two  and  three  transitions?  Why  is  there  no  limiting  probability  distribution  in  this 
example? 

3.  Compute  the  fixed  probability  vector  for  the  following  matrices : 


[0.1        0.91 
LO.6        0.4J  ' 


I 


4.  Two  urns,  labelled  (1)  and  (2),  each  contain  n  balls,  either  white  or  black. 
There  are  n  black  balls  and  n  white  balls  altogether,  but  the  compositions  of  the  urns 
may  differ.   A  transition  consists  in  choosing  a  ball  at  random  from  each  urn  and 
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interchanging  them,  putting  the  ball  from  (1)  into  (2)  and  the  ball  from  (2)  into  (1). 
Each  state  is  completely  specified  by  the  number  of  black  balls  in  urn  number  (1). 
If  at  any  state  of  the  process  there  arey  black  balls  in  urn  (1),  what  are  the  transition 
probabilities  ? 

If  initially  j  =  0  and  n  =  4,  what  are  the  probable  compositions  of  the  urns  after 
three  transitions  ? 


Show  that  the  vector  of  probabilities  <?'  =  (po,pi,p2 . . .  pn),  where /?, 


(;)■/(*)■ 


is  the  fixed  vector  for  this  transition  matrix. 

5.  A  Markov  chain  is  said  to  be  ergodic  if  it  is  possible  to  go  from  every  state  to 
every  other  state.  A  regular  chain  is  necessarily  ergodic,  since  if  the  nth  power  of  P 
contains  no  zeros,  there  is  a  non-zero  probability  of  every  possible  transition  in  n 
steps.  Show  that  the  ,chain  represented  by  the  transition  matrix 

h        0 
0        A 


is  ergodic  but  not  regular. 

6.  Suppose  that  the  following  is  an  extract  from  a  stationary  process,  the  values 
being  taken  at  successive  times  separated  by  the  fixed  interval  T:  —5,  —6,  —2,  4,  7, 
3,  1,  —5,  —1,  2.  Estimate  the  auto-correlation  coefficient  by  calculating  the  sample 
variance  of  the  observations  and  the  sample  covariance  between  successive  pairs. 
Hint:  Take  the  variance  as  [V(Xr)V(Xr-i)]m.  To  estimate  V(Xr)  use  the  last  nine 
observations,  and  to  estimate  V(Xr-i)  use  the  first  nine.  There  are  nine  pairs  in  the 
expression  for  the  covariance. 
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Appendix  A 
MATHEMATICAL  APPENDIX 

A.l  The  Limit  of  (1  +  x/n)n  for  Fixed  x,  as  n  -»  oo 

By  the  binomial  theorem, 

(A.U)     (1 +  1W  =  ,,„(!) +  ^(I)'  +  ...  +  ^i^  (I)" 

The  (/>  +  l)th  term  in  this  expansion  (the  term  involving  1//?!)  is 

—     £H)R  •(-^)-  «■ 

This  is  always  positive,  and  increases  for  fixed  p  as  n  increases  (since  the  sub- 
tracted terms  get  smaller).  Also  the  number  of  such  terms  in  (1  +  l/n)n  in- 
creases with  n.  For  both  reasons  (1  +  \/n)n  increases  with  «,  so  that  it  must 
either  have  a  limit  or  tend  to  +  oo .  But  we  can  show  that  it  has  a  limit,  as 
follows : 

All  the  factors  1 ,  1 ...  1 are  less  than  1,  so  that 


K)" 


1  1  1 

<1+1+2!+3!  +  --'+^ 
<1  +1  +5+52 +••  •  +5^ 


<1  + 


(-14--) 


But  the  sum  of  this  geometric  progression  (in  parentheses)  to  infinity  is 
1/(1  —  i)  =  2,  so  that  (1  +  l/n)n  <  3,  however  large  n  may  be.  Also,  the  sum 
is  obviously  greater  than  2  (which  is  the  sum  of  the  first  two  terms  alone  in  Eq. 
(1)).  The  limit  is  therefore  a  number  between  2  and  3  which  may  be  denoted  by  e. 
It  is  actually  an  irrational  number,  2.71828  .... 

The  expression  (A.  1.2),  with  p  fixed,  tends  to  the  value  l/p\  as  n  -►  oo.  The 
number  e  is  therefore  the  sum  1  +  1  +  1/2!  +  1/3!  +  .  .  .  ,  and  this  sum  may 
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be  calculated  to  any  required  degree  of  accuracy  by  taking  enough  terms.  On  an 
electronic  digital  computer  it  has  been  obtained  to  about  60,000  decimal  places. 
If  we  define  log  x  by  the  equation 


Cxdt 


(A.1.3)  log*  =       — ,        x>0 

and  define  the  exponential  function  exp  x  as  the  inverse  of  the  logarithmic 
function,  so  that 

(A. 1.4)  x  =  exp  y    if    y  =  log  x, 

then  the  following  argument  indicates  that  exp  x  is  the  same  as  the  above 
number  e  raised  to  the  power  x,  written  ex. 

By  Eq.  (3),  log(l  +  xt)  =        *  — ,  so  that,  on  differentiating  the  integral 

■M  U 

with  respect  to  t,  (see  §  A. 9) 

log(l  +  xt)  =  - —  (1  +  xt) 


dt  1  -{-xt  dt  1  +  xt 

for  t  >  0. 

But  by  the  definition  of  the  derivative  this  relation  merely  states  that 

,.     log(l  +  xt  +  xh)  -  log(l  +  xt)  x 

hm 


ft-o  h  1  +  xt 

and  if  we  now  let  t  ->  0  from  above, 

lim  h'1  log(l  +  xh)  =  x 

/i-0 


On  writing  h  =  1/k,  this  becomes 


that  is, 


lim  k  log(l  +  xjk)  =  x 

k-*  oo 


lim  log(l  +  x/kf  =  x 

k->oo 


Hence,  by  Eq.  (4), 

(A.1.5)  lim(l  +x/k)k  =  expx 

fc-^oo 

If  we  suppose  that  k  ->  oo  through  integral  values  only,  we  have  the  result 
(A.  1.6)  lim(l  +  xjn)n  =  exp  x 

n-*co 

It  follows  from  Eq.  (6)  and  the  earlier  part  of  this  section  that  exp  1  =  e,  so 
that  log  e  —  1.  The  quantity  ex  for  any  real  x  is  defined  by  log  ex  =  x  log  e  =  x, 
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(A.2) 

in  accordance  with  the  usual  convention  for  indices,  so  that  ex  =  exp  x  and  we 
obtain  the  required  result,  namely, 

(A.  1.7)  lim(l  +  xjn)n=ex 


A.2  Stirling's  Approximation  to  w! 

Factorials  are  not  very  convenient  for  mathematical  manipulation,  and  it  is 
often  useful  to  replace  n\  by  an  approximation.  The  most  common  approxi- 
mation is  Stirling's,  namely, 

(A.2.1)  n\x(2n)l/2nn+l/2e-n 

or,  equivalently, 

(A.2.2)  log  n !  «  \  log(27i)  +  (n  +  i)log  n-n 

The  meaning  of  the  approximate  equality  here  is  that  the  ratio  of  the  two  sides 
tends  to  1  as  n  ->  oo .  The  accuracy  of  the  approximation  may  be  gauged  by  the 
following  examples : 

n=5,    5!    =  120,  (2tt)1/255^-5  =  118.02 

11  =  10,  10!  =  3,628,800,  (2tc)1/21010^-10  =  3,598,700 

The  relative  error  is  roughly  1/(12«),  and  therefore  diminishes  as  n  increases. 
We  will  establish  Stirling's  result  in  the  following  form: 

(A.2.3)  log  n !  =  (n  +  i)log  n  -  n  +  C\ 

where  j  +  (4«)_1  <  Cn  <  1  and  then  show  that  lim  Cn  =  \  log(27i). 

n-+co 

Consider  the  curve  y  =  log  x  between  x  =  1  and  x  =  n  (Figure  55).  The 
area  under  the  curve  is  given  by 


(A.2.4) 


n  log  n  —  n  +  1 


12      3  k      &+1  n      _ 

Fig.  55    Stirling  approximation  to  n! 
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If  the  tops  of  the  ordinates  at  x  =  1,  2,  3, .  .  .  are  joined  by  chords,  the  area 
under  the  chords  will  be  less  than  that  under  the  curve,  since  the  curve  is  every- 
where concave  to  the  x-axis.  This  area  is  a  sum  of  trapeziums,  the  area  of  the 
trapezium  between  k  and  k  +  1  being  i[log  k  +  log{k  +1)].  The  total  area 
under  the  chords  is 

i[(log  1  +  log  2)  +  (log  2  +  log  3)  +  . . .  +  (log  (n  -  1)  +  log  «)] 

=  log  1  +  log  2  +  log  3  +  ...  -h  log  «  —  |(log  1  +  log  n) 

=  logw!  —  \logn. 

Since  this  is  less  than  A,  given  by  Eq.  (4),  we  have  the  inequality 

(A.2.5)  log  n !  <  (n  +  i)log  n  -  n  +  1 

This  establishes  the  upper  bound  on  Cn  in  Eq.  (3). 

If  we  draw  tangents  to  the  curve  at  P(x  =  k)  and  Q(x  =  k  +  1)  and  if  the 

tangent  at  P  meets  the  ordinate  at  Q  in  B,  the  slope  of  PB  is  l/k  ( since  —  log  x 

\          ax 

=  1/jcJ  and  therefore  NB  =  MP  +  l/k  =  log  A:  +  l/k.    Similarly  MA  =  log 

(k  +  1)  -  l/(k  +  1).  The  areas  MAQN  and  MP.&/V  are  both  greater  than  the 
area  under  the  curve  PQ  between  MP  and  NQ,  and  the  mean  of  the  two  will 
also  be  greater.  Since  MAQN  =  i[2  log(A:  +  1)  -  l/(k  +  1)]  and  MPBN 
=  i[21ogA:  +  l/k],  the  mean  of  these  two  is  i[logk  +  log(k  +  1)]  +  \[l/k 
—  l/(k  +  1)].  Summing  for  all  k  from  1  to  n  —  1,  we  see  that  the  area  under 
the  curve  is  less  than  log  n\  —  \  log  n  +  \{l  —  l/n).  We  have  therefore  from 
Eq.  (4)  the  inequality 

(A.2.6)  log  n\  >  (n  +  i)log  n  -  n  +  f  +  — 

An 

which  establishes  the  lower  bound  on  Cn.  Clearly  Cn  lies  between  i(n  =  2)  and 
1 .  Also  as  n  increases  Cn  decreases  (since  the  difference  between  the  area  under 
the  curve  and  that  under  the  set  of  chords  is  1  —  C„  and  this  difference  increases 
with  n).  Hence  Cn  must  approach  a  limit  C  as  n  -*  oo.  By  using  Wallis's 
formula  (see  Appendix  A. 8),  namely, 

we  can  evaluate  C  as  \  log  (27r)  =  0.9189.  For,  by  Eq.  (3), 

n\  =  nn+1/ V*^"  ,        (2n)!  =  (2n)2fl+1/2e-2"eC2" 
so  that  Eq.  (7)  becomes 

<->2«M2n+l 0-2n   2Cn 

lim       z    n f f =  ,,1/2 
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Since  km  Cn  =  lim  C2n  =  C  this  becomes 

ec  =  (2tt)1/2 
or 

C  =  i  log(27r) 

A.  3  Improper  Integrals 

If  f(x)  is  a  continuous  function  of  x  for  x  >  a  and  if 


(A.3.1)  lim 


b-*co 


f(x)dx(  =  l) 


exists,  then  the  improper  integral     °°/(x)  dx  is  said  to  converge  and  has  the 

value  /.  If  it  does  not  converge,  the  integral  either  diverges  to  +  oo  or  —  oo,  or 
oscillates.   Similarly,  if  f{x)  is  continuous  for  x  <  b  and  if 


f(x)dx(  =  k) 


(A.3.2)  lim 

a-*  —  oo  % 

exists,  the  integral         f(x)  dx  converges  and  is  equal  to  k. 

J  -oo 

If  both  integrals  (_     and    °°  converge  and  have  the  values  k  and  /  respec- 
tively, then    °°    f(x)  dx  converges  and  has  the  value  k  +  /. 

J-oo 

The  Cauchy  principal  value  of  the  integral         f(x)  dx  is  given  by 

J  -  00 

(A.3.3)  lim       f{x)  dx 

C-00  J-c 

This  limit  may  exist  even  though  the  two  separate  integrals  do  not.    Thus 

c     x  dx                                                                                        Ca      x  dx 
—x =  0  for  all  real  values  of  c,  but  both  integrals  —~ and 

-c*     +1  J-oo*     +1 

00    x  dx 

diverge,  since  the  indefinite  integral  of/(x)  is  here  \  log(jc2  +  1). 


a    *Z+1 

~°°      xdx 


The  improper  integral  — therefore  diverges,  but  its  Cauchy  principal 

J-oo   X     +   1 

value  is  zero. 

Another  type  of  improper  integral  is  that  in  which  f(x)  becomes  infinite 
at  some  point  or  points  of  the  range  of  integration ;  by  splitting  up  the  range 
into  sub-intervals  marked  off  by  these  points  we  need  consider  only  the  cases 
when  f(x)  becomes  infinite  at  either  the  lower  bound  or  the  upper  bound  of 
integration. 

If  f(x)  -►oo  as  x  -*  a  from  above,  but  is  otherwise  continuous  in  the 
interval  from  a  to  A,  and  if 

(A.3.4)  lim  f{x)dx  =  l 

c-+0  Ja  +  e 


386  INTRODUCTION  TO  STATISTICAL  INFERENCE  (A.4) 

then  the  integral  J   f{x)  dx  converges  and  has  the  value  /.  Similarly,  if/(jc)  -►  oo 
as  x  -+  A  from  below,  but  is  otherwise  continuous  from  a  to  A,  and  if 

/M-e 

(A.3.5)  lim  f(x)dx  =  k 

then      f(x)  dx  converges  and  has  the  value  k. 

If /(x)  becomes  infinite  at  a  value  x  =  c  between  x  =  a  and  x  =  by  we  can 
define  the  Cauchy  principal  value  of  the  integral,  if  it  exists,  by 

(A.3.6)  f(x)  dx  =  lim   (        f(x)  dx  +         f(x)  dx) 

Ja  e-H-0    \Ja  Jc+e  / 

A.4  Change  of  Variables  in  Integration 

It  is  often  convenient,  in  order  to  perform  an  integration,  to  change  the 
variables  from  one  set  to  another.  With  a  single  variable  the  process  is  probably 
familiar  to  students  who  have  had  a  course  of  calculus.  If  the  variable  is  changed 
from  x  to  w,  where  x  =  g(u),  if  g(u)  is  a  differentiable  function  of  w,  and  if/(x)  is 
integrable  from  a  to  b,  then 

(A.4.1)  f  7(x)  dx  =  [  f\_g(u)\g'{u)  du 

J  a  J  a 


where  g'(u)  =  d  g(u)/du,  a  =  g(cc)  and  b  =  g(fi).  Thus, 


1/2        dx     _  Cn/6    cos  u  du    _ 
o      VI  —  x2      Jo     v  1  —  sin2w 


•*/6  „ 

n  O 


where  the  change  of  variable  is  expressed  by  x  =  sin  u. 

Care  is  necessary  in  determining  the  new  bounds  of  integration  a  and  0  to 
make  sure  that  the  interval  from  a  to  /?  for  u  does  correspond  to  the  interval 
from  a  to  b  for  x,  particularly  when  x  is  not  a  single- valued  function  of  u.  Thus, 
for  example,  if  u  =  x2,  either  interval  of  x,  from  —2  to  - 1  or  from  1  to  2, 
would  correspond  to  the  same  interval  of  u  from  1  to  4.  In  the  first  case,  g'  (u)  is 
negative,  and  in  the  second,  positive,  but  the  bounds  of  integration  for  u  are 
interchanged  in  the  two  cases. 

If  there  are  two  variables  x  and  y,  and  the  integral  is  a  double  one,  we  may 
need  to  change  to  a  new  pair  of  variables  u  and  v,  where  x  =  g(u,  v)  and  y  — 
h{uy  v).  A  double  integral  over  a  region  R  of  the  (x,  y)  plane  is  calculated  by 
means  of  a  repeated  integral 

(A.4.2)  /(*,  y)dA=\  f{x,  y)  dy  dx 

J(R)  Jajgiix) 

The  region  of  integration  is  considered  as  bounded  by  the  curves  y  =  g^(x)  and 
y  =  9i(x)  between  x  =  a  and  x  =  b.  The  corresponding  region  R  in  the  (t/,  u) 
plane  is  bounded  by  the  curves  v  =  y^u),  v  =  y2(u\  between  u  =  a  and  u  =  P 


(A.4) 
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(Figure  56).  The  element  of  area  dy  dx  in  the  (x,  y)  plane  becomes  in  the  (w,  v) 
plane 


(A.4.3) 
where 

(A. 4.4) 


dA 


•-H-)l 

I    \u,  v/\ 


du  dv 


(-)  = 


dg 

dg 

du 

dv 

dh 

dh 

du 

dv 

dg  dh      dg  dh 
du  dv      dv  du 


u=& 


Fig.  56    Change  of  variables  in  integration 


The  functions  g  and  h  are  supposed  to  possess  continuous  first  partial  derivatives 

throughout  the  region  Ru  and  it  is  also  supposed  that/I^— |  does  not  vanish 

\u,  v) 

anywhere  in  Rv  This  function  J  is  called  the  Jacobian  of  g  and  h  with  respect 

to  u  and  v.   The  double  integral  (A.4.2)  can  then  be  written  as  the  repeated 

integral 


(A.4.5)  f     /(*,  y)  dA  =  I"'  [ 2(U)f(g9  h)  jfl 


dv  du 


An  example  is  the  change  from  Cartesian  coordinates  x,  y  to  polar  co- 
ordinates r,  0,  where  x  =  g(r9  0)  —  r  cos  0,  y  =  h(r,  6)  =  r  sin  6.    Here 


m- 


dg  dh      dg  dh 

T'-^7>  ~  "^'T-  =  cos  0(r  cos  0)  ""  (-^  sin  0)  sin  6  =  r 
dr  du      dd  dr 
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If  f(x,  y)  is  integrable  over  the  whole  first  quadrant  in  the  (x,  y)  plane,  we 
have 

f  oo  P  oo  poo/*n/2 

(A.4.6)  f{x9  y)  dy  dx  =  f(r  cos  0,  r  sin  0)  r  d6  dr 

Jojo  Jojo 

since  in  the  first  quadrant  0  ranges  from  0  to  n/2  and  r  from  0  to  oo. 

The  equations  x  —  g(u,  i?),  y  —  h(u,  v)  can  be  solved  to  express  u  and  v  in 

terms  of  x  and  y.    If  the  Jacobian  does  not  vanish,  the  resulting  functions, 

u  =  0(x,  y)  and  v  =  ^(x,  j>),  wiU  themselves  be  differentiate.    We  can  then 

/(b,\l/\ 
calculate  /  J .  It  is  sometimes  convenient  to  note  that 


m\.  it 

\x,yl 


(A.4.7) 


m-mr 


since  one  of  these  Jacobians  may  be  easier  to  calculate  than  the  other. 

The  above  considerations  may  be  extended  to  triple  or  ^-dimensional 
integrals. 

A.  5  The  Gamma  Function 

The  improper  convergent  integral 


«)=(    xn-le~x 


(A.5.1)  r(«)=       xn-le~xdx,     n>0 

is  called  the  gamma  function  of  n.  Using  the  formula  for  integration  by  parts  we 
easily  obtain 

(A.5.2)  T(n  +  1)  =       xne'x  dx 

=     -xne~x\     +  n 
=  nT(n) 
since  lim  xne~x  =  0  for  all  n  >  0. 


x-»oo 


If  we  put  n  =  1  in  Eq.  (1),  we  obtain  T(l)  =  (™  e  x  dx  =  1  so  that  T(2)  = 
1T(1)  =  1,T(3)  =  2-T(2)  =  2,  and  in  general  for  any  positive  integral  value  of «, 

(A.5.3)  r(n  +  l)  =n(n-l)...  1  =  n\ 

The  gamma  function  may  therefore  be  regarded  as  a  generalized  factorial,  and 
indeed  the  notation  n !  is  often  used  for  the  integral  denoted  here  by  T(n  +  1)  for 
any  n  >  —  1,  whether  integral  or  not. 

For  n  =  0,  the  integral  of  Eq.  (1)  diverges. 


(A.6) 
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For  negative  values  of  n,  except  negative  integers,  F(n)  may  be  defined  by 
means  of  Eq.  (2), 


Thus 


IXiO-jIXn  +  1) 


r(-i)=-2r(i) 


The  graph  of  T(n)  is  shown  in  Figure  57.   The  function  is  discontinuous  at  all 
negative  integral  values  and  at  n  =  0. 


-2 


-1 


i       i       i       i       i       i 


-  -1 

-  -2 

-  -3 

-4 


Fig.  57    The  gamma  function 


An  alternative  form  for  the  gamma  function  is  obtained  by  a  change  of 
variable.  If  x  =  w2,  we  have  by  Eq.  (A.4.1; 


r( 


n)  =  2       u2"-1 


"V"2  du 


(A.5.4) 

A.6  The  Beta  Function 

The  definite  integral 

(A.6.1)  B(m,  n)  =      xm-\l  -  x)"'1  dx,m>0,n>0 

is  called  the  beta  function  of  m  and  n.  Clearly,  B(l,  1)  =  1.   If  we  write  1  -  y 
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for  x,  we  obtain 

(A.6.2)        B(m,  n)  =  -  j    (l  _  y)-"1/-1  dy  =  J    yn~\\  -  y)1"'1  dy 

=  B(n,  m) 
so  that,  in  the  beta  function,  m  and  n  may  be  interchanged  at  will.  Alternative 
forms  for  the  beta  function  may  be  obtained  by  change  of  variable.   Thus,  if 
x  =  sin2  0, 

fii/2 

(A.6.3) 


B(m,  n)  =  l\ 

n)=  J  °V_1I 


and  if  x  =  (1  +  y)'1 
(A.6.4)  B(r 


An  important  relation  between  the  beta  function  and  the  gamma  function  is 
the  following : 

(A.6.5)  B(m,„)  =  ^L) 

T(m  +  n) 

To  prove  this,  we  note  from  Eq.  (A.5.4)  that 


T{m)T(n)  =  A       x2m-le~x  dx 


y2n-le-y2dy 


»  =  4       x2m_1 
may  be  written  a 

4  j   ^l   Xx2m-ly2n-le-(X2  +  y2)  dy  ^ 
J  0  J  0 


This  repeated  integral  may  be  written  as  a  double  integral : 


and  interpreted  as  an  integral  over  the  first  quadrant  of  the  (x,  y)  plane.  Changing 
to  polar  coordinates  (r,  6)  and  using  Eq.  (A.4.6),  we  find 

poof  */2 

r(m)r(rc)  =  4  (r  cos  6)2m-l(r  sin  fl)2""  VV  dO  dr 

Jo  J  o 

poo  rn/2 

=  4       r2m  +  2"-le-r2dr         cos2"1'^  sm2"'1 6  dO 
Jo  Jo 

where  the  double  integral  is  now  written  as  a  repeated  integral.    Using  Eqs. 

(A.5.4)  and  (A.6.3),  we  obtain 

r(m)-r(n)  =  T(m  +  n)'B(n,  m) 

whence,  by  Eq.  (2), 

r(m)r(rc) 

B(m,  n)  =  — ■ — - 

T(m  +  n) 

A.7  The  Integral  j  °°g""2  du  and  Related  Integrals 
From  Eq.  (A.6.3), 

prt/2 

(A.7.1)  5(i,  i)  =  2         dO  =  n 


(A.7) 

But,  by  Eq.  (A.6.5), 
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r(i) 


so  that 

(A.7.2)  r(i)  =  n1/2 

Therefore,  from  Eq.  (A.5.4),  with  «  =  i, 


3n  of  W, 

/*00 

e~"2  du  =  n11 

J  —  00 


/2 


(A.7.2) 

or,  since  e-"2  is  an  even  function  of  w, 

(A.7.3) 

Writing  \flu  =  y,  we  obtain 

(A.7.4)  e~v2/2  dv  =  (2n)l/2 

J   —  ao 

or 

(A.7.5)  4>{v)  dv  =  1,        4>(v)  =  (2ny1/2e-v2/2 

J   —  00 

This  expresses  the  fact  that  the  total  area  under  the  standard  normal  curve  is  1, 
as  it  must  be  if  <f){v)  is  to  be  a  probability  density  function. 
A  useful  related  integral  is 

(A.7.6)  Ik=\     vke~v2,2dv,        k  =  1,2,3... 

Putting  v  =  yJ2u,  we  get 

(A.7.7)  Ik  =       2k/2uke-u22l/2  du 

=  2^-)/^ir(^±i)byEq.(A.5.4) 

=  2(*-D/2r/ii±i\ 


Thus, 


(A.7.8) 


l'ix  =  Hi)  =  i, 

/2  =  21/2r(f)  =  21/2ir(i)  =  g) 

/3  =  2T(2)  =  2 


1/2 


j4  =  23/2r(D  =  23/2.f.i.r(i) 


-$ 


1/2 


and  so  on. 
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A.8  Wallis's  Formula 

By  Eq.  (A.  1.5),  lim(  1  -  -)    =  e~\  so  that,  for  x  >  0, 


(A.9) 


(A.8.1) 

lim 

n-»oo 

iv- 

Putting  tin  = 

w,  we 

get 

and  therefore, 

lim 

(A.8.2) 

lim  n 

n->ao 

e~ttx-1dt  =  r(x) 


(1  -  u)n(un)x-1ndu  =  T(x) 


ux~\\  -u)n  du  =  T(x) 


The  integral  in  Eq.  (2)  is  B(x,  n  +  1)  =  T(x)T(n  +  l)/r(«  4-  x  +  1),  so  that 
(A.8.3) 


r       «T(b  +  1)        1 
lim  — =  1 


and 


T(n  +  x  +  1) 
Putting  x  =  i,  and  noting  that 

r(n  +  l)  =  n! 

r(n  +  | )  =  (n  +  i)(n  -  *)(»  -  f ) .  .  .  (i)r(i) 
2n  +  1  In  -  1  2w  -  3 


2  2  2 

(2n  +  1)!tt1/2 

2w(2n-2)...(2)2"+1 

(2n  +  l)!;r1/2 


...iTT 


1/2 


n!2 


2n+l 


we  obtain  from  Eq.  (3) 

(A.8.4) 

or 

(A.8.5) 


n->ao(2n)\n 
This  is  Walhs's  formula. 


wl/2(n!)222«+l 

Si  (21.  +  !)!*^  =  l 


22n  (n*)2  In  +  1 

z.       •  l/W  i-        z-'t    '     A       1/2  1/2 

hm  -  -   r^   =  hm — n'    =nl/jL 


\„1'2 


In 


A.9  Differentiation  Under  the  Sign  of  Integration 

Let  f(x,  9)  be  a  function  of  x  depending  on  a  parameter  9  and  continuous 
over  an  interval  of  x  between  a  and  b,  where  a  and  b  are  themselves  differentiate 
functions  of  9.  Suppose  also  that  the  partial  derivative  df/dO  exists  and  is 
continuous  over  the  same  interval  of  x,  for  all  admissible  values  of  9.  Then  if 

(A.9.1)  1(9)  =         fix,  9)  dx 

Ja(0) 
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the  derivative  of  1(9)  with  respect  to  9  is  given  by 
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(A.9.2) 


dl(0) 
~d9~ 


i: 


S**/mS-*«s 


This  is  known  as  Leibniz's  formula.  A  proof  may  be  found  in  textbooks  of 
advanced  calculus,  such  as  Sokolnikoff's  (McGraw-Hill,  1939),  page  121. 
If  a  and  b  are  independent  of  0,  the  last  two  terms  vanish  and  Eq.  (2)  becomes 

bdf 


(A.9.3) 


dmr 

dO       Jfl 


d9 


dx 


Example  Let  /(x,  9)  =  <T*2/2e,  and  let  a  =  -(20)1/2,  b  =  (20)1/2.  Then 
da/dO  =  -(29)~1/2,  db/dO  =  (20)"1/2,  and  df/dO  =  (x2/292)e-x2,2e.  Also/fe  9) 
=  f(b,  9)  =  e"1,  so  that  if 


we  have 


d9 


/•(20)V2 

1(9)  =  e-x2/20  dx 

J  -(20)1/2 

29  J  -(20)1/2 


x2e-x/2edx  +  2e-\29y 


1/2 


A.  10  Orthogonal  Linear  Transformations 

The  linear  transformation 


(A. 10.1) 


y=i 


l,2...n 


is  called  orthogonal  if  the  constants  ctj  satisfy  the  conditions 

1,  when  i  =  j 


(A.  10.2) 


LCikCjk 

i'  [0,  when  i  #  J 


If  the  determinant  of  the  coefficients  ctj  is  multiplied  by  itself,  with  rows  and 
columns  transposed,  (transposing  does  not  alter  its  value),  the  result,  using 
Eq.  (2),  is 


=  1 


The  value  of  .the  determinant  is  therefore  ±  1,  and  it  can  be  taken  as  1  by 
changing,  if  necessary,  the  sign  of  one  of  the  Yt.  Since 


1 

0. 

.0 

0 

1. 

.0 

0 

0. 

.1 

(A.10.3) 


8Jl 
dXj  ~ CiJ 
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this  determinant  is  also  the  Jacobian  of  the  7's  with  respect  to  the  X's  (see 

Appendix  A.4).  Therefore, 

(A.10.4)  dYt...  dYn  =  dXl...  dXn 

It  can  be  proved  by  using  some  matrix  algebra  (see  §A.22)  that  Eq.  (2) 
implies 

(1,  when  i  =j 
0,  when  i  =£  j 
From  Eq.  (1),  by  squaring, 

(A.10.6)  Y2  =  X  ckj2Xj2  +  £  WtjXtXj 

and  therefore,  by  Eq.  (5), 

(A.10.7)  In2  =  I*/ 

k  j 

The  orthogonal  transformation  with  determinant  1  is  equivalent  geometrically 
to  a  rotation  of  the  coordinate  axes  about  the  origin.  Such  a  rotation,  of  course, 
leaves  the  distance  of  any  point  from  the  origin  unchanged,  and  this  is  the 
meaning  of  Eq.  (7). 

A.  1 1  Angle  Brackets  and  ^-Statistics 

As  in  §  5.8  we  define 

i=l 

N(N  -  1)(N  -  2Kpqr>  =  £'  XfXfXf 

ijk 

and  so  on,  where  <the  symbol  £'  indicates  that  in  every  term  of  the  sum  the 
subscripts  i,  j  .  .  .  are  to  be  different. 
Then 


s-{v)iv) 


i  U 

where  we  have  separated  the  terms  of  the  product  in  which  i  =  j  from  the  terms 
in  which  i  andy  are  different.  Since  £,-  X2  =  S2,  and  Yaj  Xi%j  =  N(N  ~  1) 
<11>,  we  have 

(A.ll.l)  St2  =  S2  +  N(N  -  1)<11> 

Again, 


=  Ixf3  +  £'xfX;2 
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from  which  we  get 

(A.11.2)  StS2  =  S3  +  N(N  -  1)<12> 

Other  results  given  in  Eq.  (5.8.2)  may  be  found  similarly.  Also, 


n(n  -  l^ii)  =  (x  x^  (y;  XjXk} 


In  the  terms  of  this  product  there  will  be  some  in  which  i  =  j,  some  in  which 
i  =  k,  and  some  in  which  i9j9  k  are  all  different.  Therefore, 

n(n  -  lys'^iiy  =  £'  x/x,  +  £'  XjXk2  +  £'  wr* 

=  2iV(N  -  1)<12>  +  iV(N  -  1)(N  -  2)<111> 

Therefore, 

(A.11.3)  (N  -  2)<111>  =  S^ll)  -  2<12> 

Similar  arguments  will  give  the  other  results  of  Eq.  (5.8.3).  The  checks 
Eq.  (5.8.6)  are  straightforward  algebraic  relations  derived  from  the  definitions 
of  the  k's  and  the  above  properties  of  the  brackets.  Thus  the  first  one  states  that 

<11>  =  «1»2-N-1«2>~<11» 
or 

^i<ll>  =  «l»2-iV-1<2> 
=  N~2S12  -N~2S2 
which  is  equivalent  to  Eq.  (A.  11.1). 

A.  12  Bernoulli  Numbers  and  Sheppard's  Corrections 

The  rth  Bernoulli  number  is  defined  as  the  coefficient  of  tr/r !  in  the  expansion 
of  t(ef  -  l)"1.  Therefore, 

(A.12.1)  tB^  =  &  -  l)"1  =^  (coth^  -  l) 

where  cothx  =  (ex  +  e~x)/(ex  —  e~x) 
The  first  few  of  these  numbers  are 

B°  =  U  B>  =  _2'  B>  =~6'B*=  "30'  Be  =  42 

All  the  B's  with  odd  subscript,  except  Blt  are  zero. 

In  the  grouping  of  a  distribution  into  classes  with  class-interval  c,  any  value 
A"  of  the  variate  is  recorded  as  the  nearest  class-mark  X{.  The  difference  between 
Xt  and  X  is  not  greater  numerically  than  c/2."  That  is, 

(A.12.2)  X{  =  X  +  e 
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where  e  may  be  supposed  to  have  a  uniform  (rectangular)  distribution  on  the 
interval  —  c/2  to  c/2.  This,  of  course,  is  not  usually  true  and  is  an  assumption  for 
reasons  of  mathematical  convenience,  but  if  the  grouping  is  reasonably  fine 
(intervals  short  compared  with  the  effective  range)  it  is  not  likely  to  be  very  far 
out.  If  K(h),  Kt(h)  and  KE{h)  are  the  cumulant  generating  functions  for  X,  X{ 
and  e  respectively, 

(A.12.3)  UK)  =  K(h)  +  Ke(h) 

by  the  main  property  of  such  functions,  §  2.12. 

Now  Kt(h)  is  the  c.g.f.  for  the  grouped  distribution  and  will  give  the  uncor- 
rected cumulants,  while  K(h)  is  the  c.g.f.  for  the  true  distribution  and  will  give 
the  true  cumulants.  Also,  by  §  2.10,  Example  4, 

ch  ch 

(A.  12.4)  KE(h)  =  log  Me(h)  =  log  sinh  —  -  log  — 

-  log  sinh(*/2)  -  log(r/2) 

where  t  =  ch. 

Differentiating  with  respect  to  t,  we  have 

dKJLh)      1       Jt\      1 

But,  by  Eq.  (1), 

i^thi-l)-!--^!^ 

so  that,  on  dividing  by  /, 

1       ,_  (t\      1      »  n  f-1 
(A.12.6)  _coth(_j__=^r_ 

From  Eqs.  (5)  and  (6),  therefore, 

dKe(h)     «       f-1 

from  which,  after  integration,  we  obtain 

KM  =  c  +  tBr-£r, 

2  TT\ 

Since  Ke(h)  ->  0  as  t  ->  0  (t  =  ch),  the  constant  C  is  zero,  and  therefore, 

2       vr\ 
This  is  equivalent  to 

(A.  12.7)  (Kr)c  =  Kr  -  cr  -r,        r  =  2,  3  .  . . 
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In  practice,  the  corrections  are  usually  applied  to  the  sample  ^-statistics, 
rather  than  to  the  cumulants,  since  the  latter  are  seldom  known  except  insofar  as 
they  are  estimated  by  the  former. 

A.  13  The  Non-Central  Chi-Square  Distribution 

The  quantity  x  with  probability  density, 

(A.13.1)  f^e-^-^t-^—^ 

is  said  to  have  the  non-central  chi-square  distribution  with  k  degrees  of  freedom 
and  with  parameter  of  non-centrality  X.  When  X  =  0  it  reduces  to  the  ordinary 
chi-square  distribution,  the  first  term  of  the  series  being  interpreted  as  l/r(/r/2), 
and  x  taken  as  #2/2. 

If  Xl9  X2, .  .  .  Xk  are  independent  normal  variates,  with  unit  variance  and 
with  expectations  fj,lf  fi2  . . .  ^fc,  and  if  H0  is  the  null  hypothesis  which  specifies 
the  values  of  the  nt  as  /i^,  n2°  •  •  •  J"fc°»  then  an  unbiased  test  (see  §  8.10)  of  the 
hypothesis  H0  is  provided  by  the  rule :  reject  H0  if 

(A.13.2)  Z(*i-tt°)2>X*2(a) 

where  /fc2(a)  is  the  tabular  value  of  central  x2  with  k  degrees  of  freedom, 
corresponding  to  the  significance  level  a,  i.e., 

1 

fW2)_ 

If  H0  is  not  true,  but  instead  fit  is  not  equal  to  /^°  for  at  least  one  value  of  i, 
(hypothesis  //\)  the  quantity  on  the  left-hand  side  of  (2)  follows  the  non-central 
chi-square  distribution  with  k  degrees  of  freedom  and  parameter 

(A.13.4)  i-ll(ft-  ft0)2 

i=l 

The  power  function  of  this  test  is 

(A.13.5)  P(X)  =  Prfc  (Xt  -  /i,0)2  >  jfc2(a)|tf  i} 

Xm 


(A.13.3) 


udu  =  (x 


=  *-AI 


xm+ik-1e-x  dx 

iZk2(«) 


mf0m!r(m  +  /c/2)% 

In  the  tables  prepared  by  Miss  Evelyn  Fix  (reference  [5]  of  Chapter  6)  the 
quantity  tabulated  is  the  value  of  X  for  certain  assigned  values  of  a  and  P(l),  and 
for  k  =  1(1)20(2)40(5)60(10)100.  (The  X  of  the  tables  is  twice  the  X  defined 
above). 

A.  14  Some  Theorems  on  Conditional  Probability 

Let  Xbe  a  random  variable  which  takes  the  value  jcf  with  probability  pt(Y), 
subject  to  the  occurrence  of  the  event  Y.  That  is, 

(A.14.1)  Pi(Y)  =  P(X  =  xt\Y),        i  =  1,  2  . . .  n, 

=  P{(X  =  xi)nY}/P(Y) 


398  INTRODUCTION  TO  STATISTICAL  INFERENCE  (A.  14) 

Then  the  conditional  expectation  of  X,  given  Y,  is  defined  as 
(A.14.2)  E(X\Y)=tpi(Y)xl 

i  =  l 

This  is  the  same  definition  as  for  the  ordinary  expectation,  except  that 
conditional  probabilities  are  used. 

If  Yj  is  the  event  that  the  random  variable  Y  takes  the  value  yj9 

Theorem  1  The  expected  value  of  X  is  equal  to  the  expectation  of  the  conditional 
expectation  of  X,  given  Y.  In  symbols, 

(A.  14.4)  E(X)  =  E\E(X\Y)~\ 

=  ZPjE(X\Yj) 
j 

where  pi  is  the  probability  that  Y  =  yy  Proof'. 

E(X\Yj)  =  YxiPi(Yj) 

i 

by  the  definition  of  conditional  expectation.  Also,  by  Eq.  (3), 

P(X  =  Xi,Y  =  yj) 

Therefore,  the  right-hand  side  of  Eq.  (4)  is  £f  £,-  xtP(X  =  xu  Y  =  yj).  But  the 
sum  over  j  covers  all  the  possible  values  of  Y,  and  some  value  must  occur  with 
every  X;  consequently, 

YP(X  =  xhY  =  yj)  =  P(X  =  xi) 
j 

and  the  right-hand  side  of  Eq.  (4)  reduces  to 

X  xtP(X  =  xt)  which  is  E(X) 

i 

The  conditional  variance  of  X  given  Y  is  defined  by 

(A. 14.5)  V(X\Y)  =  E{[X  -  E(X\Y)Y\Y} 

with  a  similar  definition  for  conditional  covariance. 

Theorem  2  The  variance  of  X  can  be  regarded  as  made  up  of  two  parts,  the 
expectation  of  the  conditional  variance  and  the  variance  of  the  conditional  expec- 
tation. Symbolically, 

(A.14.6)  V(X)  =  E[V(X\Yy]  +  K[£(X|Y)] 

Proof:  we  may  write  X  -  E(X)  as  X  -  E(X\  Y)  +  £(^1 Y)  -  E(X\  so  that 

(A.  14.7)       [X  -  E(X)Y  =  [X  -  E(X\  7)]2  +  2[X  -  E(X\  Y)][£(X|  Y)  -  E(X)~] 

+  [E(X\Y)-E(X)y 
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The  variance  of  X  is  the  expectation  of  this  expression.  By  Eq.  (4), 

E\_X  -  E(X\ Y)]2  =  E{E[X  -  E(X\  Y)f  \  Y} 

=  E[V(X\Y)l        byEq.(5) 

Also,  since  E(X)  =  E[E(X\  Y)],  the  expectation  of  the  last  term  of  Eq.  (7)  is 
Ff^Zl  Y)].  It  only  remains  to  show  that  the  expectation  of  the  middle  term  of 
Eq.  (7)  is  zero.  This  middle  term  is 

(A.14.8)  2{XE(X\  Y)  -  XE(X)  +  E(X)-  E(X\  Y)  -  \E(X\  Y)]2} 

Now  E[XE(X\Y)]  =  E[E(X\Y)-E(X\Y)]  by  Eq.  (4),  so  that  the  expectations  of 
the  first  and  last  terms  of  (8)  cancel.   Similarly, 

E\_XE(X)~\  =  E\E(X\YyE(Xy\ 

so  that  the  expectations  of  the  two  middle  terms  in  (8)  cancel.  The  variance  of 
X  is  therefore  the  sum  of  expectations  of  the  first  and  last  terms  of  Eq.  (7)  and 
this  gives  Eq.  (6). 

A.  1 5  Extrema  of  a  Function  of  Several  Variables  Connected  by  Given  Relations 

For  the  sake  of  definiteness,  let  us  think  of  a  function  of  three  variables, 
f(x,  y,  z),  for  which  we  want  a  maximum  or  minimum  subject  to  the  given 
relation  <j>{x,  y,  z)  =  0/ 

This  relation  may  be  regarded  as  expressing  z  in  terms  of  x  and  y.  The 
conditions  for  an  extremum  of/ are 


(A.15.1) 


df      df  dz      „ 
dx      dz  dx 

df      df  dz 
—  +  — —  =0 
dy      dz  dy 


dz  dz 
and  the  partial  derivatives  — ,  — ,  are  connected  by  the  equations 

dx  dy 


(A. 15.2) 


(d<\>    dcj)  dz    „ 

dx      dz   dx 
dy      dz   dy 


Eliminating  these  partial  derivatives  from  Eqs.  (1)  and  (2),  we  get 


(A.15.3) 


dx  dz       dz  dx 

^;#_<y#  =0 

^dy  dz      dz  dy 
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which  may  be  written  as  Jacobians  (see  §  A.4) 

(A. 15.4)  j(— )  =  0,        j(—\  =  0 

\x,zj  \  y,  zj 

These,  together  with  <^  =  0,  determine  the  values  of  xy  y  and  z  corresponding 
to  extrema.  Equations  (3)  express  the  conditions  that  we  can  find  a  function  X  to 
satisfy  the  three  equations 


(A.15.5) 


dx         dx  dy        dy 

df      ,W_0 

dz         dz 


so  that  we  may  replace  Eqs.  (3)  or  (4)  by  the  set  (5),  in  which  X  is  an  unknown 
auxiliary  function.  This  set  is  given  by  equating  to  zero  the  partial  derivatives 
of  the  function  f  +  Xcj)  with  respect  to  the  variables  x,  y,  z  where  X  is  regarded 
for  the  purpose  of  this  differentiation  as  a  constant.  The  quantity  is  called  an 
undetermined  multiplier,  and  the  method  is  due  originally  to  Lagrange. 

The  method  may  be  extended  to  n  variables  connected  by  h  relations.  The 
extrema  of/O^,  x2  .  .  .  xn),  subject  to  the  conditions  cj)l  =  0,  <p2  =  0  .  .  .  <j)h  =  0, 
are  found  by  equating  to  zero  the  n  partial  derivatives  of  the  function, 

the  undetermined  multipliers  Xx  .  .  .  Xh  being  regarded  as  constants.  The  actual 
values  of  these  multipliers  do  not  matter. 

A.  16  The  Multinomial  Theorem 

The  multinomial  theorem  gives  the  expansion  of  (x^  +  x2  +  .  .  *  +xk)n, 
where  we  suppose  that  n  is  a  positive  integer.  Each  term  in  the  expansion  is 
formed  by  picking  an  xt  from  the  set  xi9  x2  .  .  .  xk,  doing  this  n  times  and 
multiplying  the  results  together.  If  a. particular  xt  is  picked  nt  times,  each  term 
in  the  product  is  of  the  form  x"xx22  .  •  .  xkk,  where  £*  nt  =  n.  The  number  of 
terms  like  this  which  are  identical  in  value  is  the  number  of  ways  of  arranging 
ny  objects  of  one  kind  {xx),  n2  of  another  kind  (x2),  and  so  on,  and  by  Theorem 
1.11  this  number  is  «!/(«1!«2!  ■  •  •  nk\).  We  have,  therefore, 

(A. 16.1)  (*!  +  x2  +  .  .  .  +  xk)n  =  I— j — r :  *!ni  . . .  xk 


where  the  sum  is  taken  over  all  sets  of  values  of  nl9  n2  . . .  nk  (all  non-negative) 
such  that  Yj  nt  =  n- 

As  an  illustration,  if  n  =  3  and  k  —  3,  the  possible  sets  of  values  of 
(nl9  n2,  n3)  are  (3,  0,  0),  (0,  3,  0),  (0,  0,  3),  (2,  1,  0),  (2,  0,  1),  (1,  2,  0),  (1,  0,  2), 
(0,  1,  2),  (0,  2,  1)  and  (1,  1,  1),  so  that  (xt  +  x2  +  x3)3  =  V  +  x23  +  x33  + 
3(xl2x2  +  xl2x3  +  xxx2    +  x^x32  +  x2x32  +  x2x<>)  +  6(xlx2x3). 
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The  terms  in  the  sum  on  the  right-hand  side  of  Eq.  (1)  can  be  interpreted 
as  the  probabilities  that  a  random  sample  of  n  objects,  drawn  from  a  popu- 
lation which  is  divided  into  k  classes  will  have  exactly  n1  in  the  first  class, 
n2  in  the  second  class  and  so  on.  It  is  assumed  that,  in  the  population,  the 
probability  that  an  object  falls  in  the  ith  class  is  xt,  where  of  course  xt  >  0 
and£?=i*i  =  1. 

Such  a  distribution  of  n  objects  among  k  classes  is  called  a  multinomial 
distribution.   The  binomial  distribution  is  the  special  case  k  =  2. 

A.  17  The  Multinomial  Distribution  and  Chi-Square 

The  probability  of  the  particular  multinomial  distribution  with  ft  objects  in 
the  ith  class  (i  =  1,  2  ...  k)  is,  from  (A.  16.1), 

(A.17.1)  p  =  f.ff!     r.  tt/'tt/2  .  .  .  n/* 

where  £/,  =  N  and  where  nt  is  the  probability  that  any  object  in  the  population 
belongs  to  the  /th  class.  Therefore 

(A.17.2)  log  p  =  log  N\  -  X  log/,!  +  £/,  log  nt 

i  i 

Using  Stirling's  approximation  (Appendix  A.2),  on  the  assumption  that  the 
fi  are  all  sufficiently  large,  we  can  replace  log  ft !  by  (ft  +  i)log  /)  —  ft  +  \  log 
(2n),  and  thus  obtain 

(A. 17.3)  log  p  =  (N  +  i)log  N-N  +  i  log(27r) 

"  Z  (ft  +  i)10g/,  +  I/l  -f  l0g(27T)  +  J)/l  lOg  71, 

_  /Nk-\       k  —  1 

=  i  log  JV  +  £/f  log^j  -  —  log(27r)  -  i  X  log/, 

since  the  term  AT  log  JV  may  be  written  £/,  log  JV.  We  therefore  have 

(A.17.4)  logp  =  logC  +  X(/  +  i)Iog(^ 

where 

(A.17.5)  log  C  =  -  £^i  log(27TiV)  -  J  £  log  7Tf 

If  we  put  0|  =  Nit}  (which  is  the  expected  number  in  the  ith  class  in  a  sample 
of  size  AO  and  let 


(A.17.6)  z,  =(/*-*)/*i 


1/2 
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we  have 

(A.17.7)      log  p  -  log  C  =  £(fi  +  i)J°g(f ) 

=  -Z(/f  +  i)iog(i  +  ^r1/2z,) 

=  -  Z  (*,  +  <ki;%  +  i)log(l  +  <t>i-ll\). 

Now  if  the  <£f  are  fairly  large,  the  differences  between  the/;  and  the  ■<^i  will 
usually  be  small  compared  with  the  0f  themselves,  so  that  we  would  expect 
4>i~ll2zi  to  be  less  than  1.  If  therefore  we  expand  the  logarithm  in  Eq.  (7)  in  a 
series,  namely, 

iog(i  +  4>rl,2*d  =  *r  171*i  -  u>rW  + ... . 

and  multiply  by  the  preceding  factor,  we  obtain 

(A.17.8)  log  p  -  log  C  =  -  Z  [i*,2  +  ^1/2  +  0(*r 1/2)]. 

i 

But  Zi  *^1/2  =  ZiC/i  -  M  =  0,  so  that,  to  order  0r1/2, 

log  p  -  log  C  «  -  i  Z  zt2 
or,  equivalently, 

(A.17.9)  p  =  Ce~^Zi\        C  =  (27iiV)-(fc-1)/2n  nf 1/2 

This  means  that  the  z{  are  approximately  normally  distributed  about  zero 
with  unit  variance.  They  are  not,  however,  independent,  since  they  are  subject 
to  a  linear  constraint 

(A.17.10)  £^1/2=() 

We  have  seen  in  §  4.6  that  the  sum  of  squares  of  n  independent  standard 
normal  variates  is  distributed  as  y2  with  n  degrees  of  freedom.  If  we  make  an 
orthogonal  transformation  (§  A.  10)  to  new  variables  yjy  where  yi  =  Zi  CjiZit 
and  let  Cki  =  0f1/2,  the  variable  yk  will  be  zero  by  Eq.  (10)  and  we  shall  have 
k  —  1  independent  variates,  all  normally  distributed  about  zero  with  unit 
variance.  Also  Z  z?  —  Zi~  *  yf>  so  tnat  Z  zt2  ^s  tne  sum  of  squares  of  k  —  1 
independent  standard  normal  variates  and  is  therefore  distributed  as  y2  with 
k  -  1  d.f. 

It  may  be  shown  that  if  the  variates  are  connected  by  /  linear  constraints, 
Z  Zi2  is  approximately  y2  with  k  —  I  d.f.  This  sum,  by  Eq.  (6),  may  be  written 

and  this  is  the  ordinary  definition  of  %2  for  a  variate  grouped  into  classes. 

A.  18  Matrix  Algebra 

A  set  of  mn  elements  arranged  in  a  rectangular  array  of  m  rows  and  n  columns 
is  called  a  matrix  of  order  m  by  n.  When  m  =  n  the  matrix  is  said  to  be  square. 
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The  elements  may  be  real  or  complex  numbers,  but  in  statistical  applications 
they  are  usually  real.  The  whole  array  is  often  denoted  by  a  single  letter,  or  by  a 
typical  element  enclosed  in  parentheses.  Thus 


A  = 


ana 


a2\  a12  •  •  •  a 


In 


_am\am2-  •  -«mn. 


=  K) 


The  matrix  is  thought  of  as  a  single  mathematical  entity,  which  is  subject  to 
algebraic  operations.  These  operations  must,  of  course,  be  defined. 

A  matrix  with  a  single  row  or  a  single  column  is  called  a  vector. 

A  matrix  is  zero  if  and  only  if  all  its  elements  are  zero. 

Two  matrices  are  conformable  if  they  each  have  the  same  number  of  rows 
and  also  the  same  number  of  columns. 

Equality.  Two  conformable  matrices  are  equal  if  and  only  if  each  element  in 
one  is  equal  to  the  corresponding  element  in  the  other. 

Addition.  The  sum  of  two  conformable  matrices  A  and  B  is  a  conformable 
matrix  C  such  that 


(A.18.1) 


CU  =  au  +  btj 


That  is,  elements  in  corresponding  positions  in  the  two  matrices  are  simply 
added.  Subtraction  is  similarly  defined.  The  matrix  denoted  by  —  A  has  all  its 
elements  opposite  in  sign  to  those  of  A. 

Multiplication  of  a  matrix  and  a  number.  If  X  is  any  number  (real  or  complex) 
the  product  XA  is  defined  by 


(A.18.2) 


XA  =  (Xatj) 


Each  element  in  A  is  multiplied  by  X. 

Multiplication.  If  A  is  an  m  by  p  matrix  and  B  a  p  by  n  matrix,  the  product 
AB  is  an  m  by  n  matrix  defined  by 


(A.18.3) 


AB 


=  (t^ikKjj 


For  this  product  to  exist  it  is  necessary  that  the  number  of  rows  in  B  should  be 
equal  to  the  number  of  columns  in  A.  Each  row  in  A  is  multiplied,  term  by  term, 
with  each  column  in  B. 


Example  1 


3        1 

2     -4 


1  5" 
8  2 
3     -1 


23        13 
15       -3 
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(A.19) 


Here  3(1)  +  1(8)  +  4(3)  =  23,  and  so  for  the  other  elements  of  the  product. 

Matrix  addition  satisfies  the  ordinary  commutative  and  associative  laws  of 
elementary  algebra,  but  this  is  not  true  of  multiplication.  If  the  matrices  in 
Example  1  above  are  multiplied  in  the  reverse  order  we  get  an  entirely  different 
product : 


"1 

5" 

8 

2 

_3 

-1_ 

]- 


13 

-16 

29 

28 

0 

42 

7 

7 

7 

It  is  not  therefore  true  in  general  that  matrix  multiplication  is  commutative.  It 
is  necessary  to  distinguish  between  "pre-multiplication"  and  "post-multipli- 
cation," or  to  speak  of  multiplying  "on  the  left"  or  "on  the  right."  If  we 
multiply  B  by  A  on  the  left,  we  get  AB,  and  if  on  the  right,  BA. 

Another  law  of  elementary  algebra  that  does  not  hold  with  matrices  is  the 
product  law.  This  asserts  that  if  ab  =  0,  then  either  a  or  b  must  be  zero.  But  AB 
can  be  a  zero  matrix  without  either  A  or  B  being  zero. 

Example  2 


"  2 

-1" 

"1 

3" 

"0 

0" 

.10 

-5. 

2 

6 

.0 

0. 

On  the  other  hand,  the  associative  and  distributive  laws  of  algebra  hold  for 
matrices,  provided  of  course  that  the  matrices  are  properly  conformable  for  the 
operations  suggested.   With  this  understanding  we  can  write 


(A. 18.4) 


(AB)C  =  A(BC)  =  ABC 
A(B  +  C)  =  AB  +  AC 
(A  +  B)C  =  AC  +  BC 


A.19  Transposition 

If  the  successive  columns  of  a  matrix  A  are  written  as  successive  rows  of  a 
new  matrix  A\  then  A'  is  called  the  transpose  of  A.  If  A  is  an  m  by  n  matrix,  A' 
is  an  n  by  m  matrix,  and  a'u  =  ajt. 

Example  3 


A  = 


3 

6 

2' 

2 

1 

0 

5 

9 

7 

1 

0 

6 

A' 


3 

2 

5 

1 

6 

1 

9 

0 

2 

0 

7 

6 

The  transpose  of  a  row  vector  is  a  column  vector,  and  conversely. 


(A.19) 
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Thus 


A  square  matrix  is  symmetric  if  A'  =  A  and  skew-symmetric  if  A'  =  —A. 
-is  symmetric  and     _       n     is  skew-symmetric.    In  any  skew- 
symmetric  matrix  all  the  elements  along  the  principal  diagonal  must  be  zero. 
The  following  theorem  is  sometimes  referred  to  as  the  reversal  rule : 

(A. 19.1)  (AB)'  =  B'A' 

Proof:  If  c'ij  is  the  element  in  the  /th  row  and/h  column  of  (AB)', 

C'ij  "  CJi  =  X  aJkbki 
k 

=  yLa'kjb'ik  =  Ya  b'ika'kj 
k  k 

which  is  the  (ij)th  element  of  B'A'.   Similarly,  (ABC)'  =  C'B'A\  etc. 

A  square  matrix  in  which  all  the  elements  except  those  in  the  principal 
diagonal  are  zero  is  called  a  diagonal  matrix.  A  diagonal  matrix  is  symmetric, 
and  commutes  with  any  other  diagonal  matrix  having  the  same  number  of  rows. 


Example  A 

\ 

"3 

0 

0" 

~-l 

0        0" 

"-3        0 

!  o 

0 

1   ■ 

0 

• 

0 

-4        0 

= 

0     -4 

0 

_0 

0 

2_ 

0 

0        3_ 

0        0 

6 

"-1 

0 

0" 

"3        0        0" 

= 

0 

-4 

0 

• 

0        1        0 

0 

0 

3_ 

_0 

0        2_ 

Multiplication  of  a  matrix  by  a  number  X  is  equivalent  to  multiplying  (either 
on  the  left  or  on  the  right)  by  a  diagonal  matrix  with  each  non-zero  element 
equal  to  X. 


iXAM 

PLE  5 

~x 

0 

0" 

"2 

1 

7" 

~2X 

X 

1X~ 

"2 

1 

7 

0 

k 

0 

• 

9 

3 

-2 

= 

91 

3X 

-2X 

=  X 

9 

3 

-2 

0 

0 

I 

1 

4 

-5 

X 

4X 

-5X 

1 

4 

-5 

A  diagonal  matrix  with  each  diagonal  element  1  is  called  a  unit  matrix.  Multi- 
plication by  a  unit  matrix,  on  the  right  or  on  the  left,  leaves  any  other  matrix 
unchanged. 

(A.19.2)  AI  =  IA  =  A 

where  /  is  a  unit  matrix  with  the  proper  number  of  rows  and  columns. 
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(A.20) 


A. 20  The  Determinant  of  a  Matrix 

If  A  is  a  square  n  x  n  matrix,  the  determinant  of  A,  denoted  by  d(A),  is  a 
polynomial  of  the  nth  degree  in  the  elements  of  A.  The  terms  of  the  polynomial 
are  obtained  by  multiplying  together  all  possible  sets  of  n  elements  taken  one 
from  each  row  and  one  from  each  column,  and  giving  the  products  alternate 
plus  and  minus  signs.  It  is  assumed  that  the  reader  is  familiar  with  the  elementary 
properties  of  determinants  as  found  in  most  text-books  of  college  algebra,  but  we 
recall  briefly  a  few  of  these  properties. 

The  usual  notation  for  a  determinant  is 


(A.20.1) 


d(A) 


a„i 


...aln 


.  a, 


=  \as 


The  determinant  of  n  —  1  rows  obtained  by  omitting  the  /th  row  and  jth 
column  of  d(A)  is  called  the  minor  of  au  and  will  be  denoted  by  d(Atj).  The 
signed  minor 


(A.20.2) 


Cl7  =  (-iy+M4v) 


is  called  the  cof actor  of  atj.  The  formulas  for  the  development  of  d(A),  in  terms 
of  the  Ith  row  and  in  terms  of  they th  column,  are 


(A.20.3) 


d(A)  =  Y,^iAj  =  I^jCij 


If  in  these  formulas  we  replace  the  cofactors  of  the  ith  row  (or  the/h  column) 
by  the  cofactors  of  a  different  row  or  column  (what  Aitken  has  called  alien 
cofactors)  the  expressions  reduce  to  zero.   That  is, 


(A. 20.4) 


£fll7Cfti  =  0,        i*k 
£  auCik  =0,        ;  #  k 


A  convenient  symbol  for  expressing  such  pairs  of  relations  as  Eqs.  (3) 

and  (4)  is  the  Kronecker  delta,  Sjk,  defined  as  equal  to  1  when  /  =  k,  and 

equal  to  0  when  j  ^  k.  Equations  (3)  and  (4),  with  this  notation,  may  be 
written : 


(A.20.5) 


Z  auckj  =  $ik  d{A) 
Y"ijCik=Sjkd(A) 
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Even  if  a  matrix  is  not  square,  we  can  form  determinants  by  crossing  out 
rows  and/or  columns  to  leave  square  arrays.  The  determinants  of  these  arrays 
are  all  determinants  of  the  matrix.  The  rank  of  a  matrix  is  the  order  of  the 
largest  non-zero  determinant  that  can  be  formed  in  this  way.  A  square  n  by  n 
matrix  is  called  singular  if  the  determinant  of  the  whole  matrix  is  zero.  Its  rank 
is  then  of  course  less  than  n. 

If  A  and  B  are  square  matrices  of  the  same  size,  and  if  C  =  AB,  then 

d(C)  =  d(A)-d(B) 

Although  the  matrices  AB  and  BA  are  in  general  different  they  have  the  same 
determinant. 

A.21  The  Inverse  of  a  Matrix 

If  an  n  by  n  matrix  A  is  non-singular,  there  exists  a  unique  n  by  n  matrix 
denoted  by  A~x,  such  that 

(A.21.1)  AA~1=A~1A=I 

This  matrix  A-1  is  called  the  inverse  of  A. 

The  transpose  of  the  matrix  of  cofactors  of  the  elements  of  A  is  called  the 
adjoint  of  A,  denoted  by  adj  A.  That  is 

(A.21.2)  adj  A  =  (Cy)'  =  (CM) 

It  follows  that 

(A.21.3)  A-  adj  A  -  (^  aikC^  =  (fiu  d(A)) 

But  since  d(A)  is  a  number  and  6U  is  the  (i,j)th  element  of  the  unit  matrix  /,  we 
can  express  the  matrix  on  the  right  of  Eq.  (3)  as  d(A)I.  We  have,  then, 

so  that  the  inverse  of  A  is  the  adjoint  of  A  divided  by  its  determinant  (supposed 
non-zero).  We  can  therefore  define,  for  non-singular  square  matrices,  the 
operations  of  "pre-di vision"  and  "post -division,"  symbolised  by  A~XB  and 
BA~K 

The  reversal  rule  applies  to  inversion,  namely, 

(A.21.5)  (AB)'1=B-1A-1 

To  prove  this,  we  note  that 

(AB)(AByl  =I  =  AA~l  ^AiBB-^A-1 
=  (AB)(B-1A~1) 
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where  we  have  used  the  associative  law  for  matrix  multiplication. 

The  operations  of  transposition  and  inversion  are  commutative.  That  is, 

(A.21.6)  (A-'y  =  (ATl 

Proof:  A'iA-1)'  —  {A~XA)'  =  V  =  I,  which  shows  that  (A'1)'  is  the  inverse 
of^'. 

The  elements  of  A  ~ 1  are  conveniently  written  as  aiJ9  using  superscripts  instead 
of  subscripts.   From  Eqs.  (2)  and  (4)  it  follows  that 

(A-2L7)  flW=|f) 

Given  the  set  of  normal  equations  in  §  12.1,  namely,  Ab  =  #,  we  can  solve 
for  b  by  multiplying  both  sides  on  the  left  by  A'1.  This  gives  b  =  A~1g,  or, 
using  the  notation  of  Eq.  (7), 

(A.21.8)  bt  -  £  a%  =  (jT  Cja])  / d(A) 

Since  ^  Ciigi  is  the  expanded  form  of  d(A)  in  which  the  ith  column  has  been 
replaced  by  a  column  of  #'s,  this  equation  is  a  statement  of  Cramer's  rule 
(§12.3). 

A.22  Orthogonal  Matrices 

A  non-singular  square  matrix  A  is  orthogonal  if  its  transpose  is  equal  to  its 
inverse,  that  is,  if 

(A.22.1)  AA'  =  I 

The  matrix  of  the  coefficients  of  an  orthogonal  transformation  (§  A.  10)  is 
orthogonal.  Thus  the  transformation  expressed  by  Eqs.  (A.  10.1)  and  (A.  10.2) 
can  be  written  in  matrix  form 

(A.22.2)  Y  =  CX,        CC  =  I 

where  Y  and  X  are  column  vectors.   An  example  is  the  transformation 

Yx  =  Xl  cos  9  -  X2  sin  6 

Y2  =  Xx  sin  0  +  X2  cos  9 
which  corresponds  to  a  clockwise  rotation  of  the  axes  about  the  origin  through 


i    n    m*         x  •     Tcos  6  —  gin  6 

an  angle  9.  The  matrix     .    n  n 

6  [sin  0      cos  0 


is  orthogonal. 


A.23  Calculation  of  the  Inverse  of  a  Matrix 

The  solution  of  a  set  of  normal  equations  (§  12.1)  is  immediately  expressible 
in  terms  of  the  inverse  matrix  of  the  coefficients.  Moreover,  some  of  the  elements 
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of  the  inverse  matrix  are  required  for  testing  the  significance  of  the  partial 
regression  coefficients  b{.  It  is  therefore  sometimes  worth  while  to  invert  a 
matrix.  In  practice  the  matrix  is  usually  symmetric. 

With  a  square  matrix  of  three  rows  and  columns,  it  is  a  simple  matter  to 
compute  the  cofactors  and  the  whole  determinant,  and  thus  obtain  the  inverted 
matrix  directly  from  its  definition  (§  A.21).  However,  for  larger  matrices  a  more 
compact  and  systematic  method  is  desirable.  The  following  method  is  known  as 
Jordan's. 

If  we  can  find  a  matrix  J  such  that 

(A.23.1)  J(A,  /)  =  (/,  J) 

then  J  is  A'1.  Here  /  is  a  unit  matrix  of  the  same  number  of  rows  as  A,  placed 
alongside  A.  The  method  consists  in  multiplying  the  augmented  matrix  (A,  I) 
by  successive  matrices  of  the  form  /  +  Jh  where  /,  differs  from  a  zero  matrix 
only  in  the  /th  column.  By  suitably  choosing  the  non-zero  elements  of  Jt  we 
can  build  up  a  unit  matrix  /  on  the  left  of  the  product  matrix  and  the  rest 
of  the  product  matrix  is  then  the  required  inverse,  J.  For  a  4  x  4  matrix  we 
should  have 

(I  +  J^AJ)  ^{AUK,) 
(I  +  J2)(Al,K1)  =  (A2,K2) 
(I  +  J3)(A2,  K2)  =  (A39  K3) 
(I  +  J4)(A3,K3)  =  (I,A-1) 

The  first  column  of  /  +  J1  is  chosen  so  that  the  first  column  of  A  t  has  elements 
(read  downwards)  1,  0,  0,  0,  and  so  need  not  be  recorded.  This  first  column  of 
I  +  Jx  becomes  the  first  column  of  Kx  and  is  recorded  there.  The  second  column 
of  /  +  J2  is  chosen  so  that  the  second  column  of  A2  becomes  0,  1,  0,  0, 
and  the  second  column  of  K2  becomes  this  second  column  of  I  +  J2,  and  so 
on.  In  this  way  the  unit  matrix  /  is  built  up  in  four  steps,  but  need  not  be 
recorded. 

If  the  elements  of  the  first  column  of  A  (read  downwards)  are  at,  <z2,  a3,  a4, 
those  of  the  first  column  of  /  +  Jl  are  \jau  —a2/au  —a3/a1,  -ajal.  If  the 
elements  of  the  second  column  of  Ax  are  a\9  a'2,  a '3,  a\,  then  those  of  the 
second  column  of  /  +  J2  are  —  a\/a'2,  \/a'2,  —a'3/a'2,  —a'Ja'2.  The  other 
products  are  formed  similarly. 

Example  6  The  steps  in  the  calculation  of  the  inverse  of  a  4  x  4  matrix  are 
given  below.  At  each  step  the  matrix  /  +  Jt  has  been  shown  in  full  for  the  sake  of 
clearness,  but  it  is  not  actually  necessary  to  set  it  down.  Unit  columns  in  the 
augmented  matrix  have,  however,  been  omitted.  The  pivotal  elements  in  the  key 
columns  of  A,  Au  A2,  and  A3  are  marked.  Thus  a{  =  1.0,  a  2  =  0.84,  a" 3 
=  0.7381,  a'\  =  0.5903.  These  are  the  elements  whose  reciprocals  are  used  in 
forming  the  corresponding  columns  of  /  +  /,-. 
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I  +  Ji  A  I 


(A.24) 


1.0 

0 

0 

0   -i 

rl.O 

0.4 

0.5 

0.6 

-""" 

"- 

0.4 

1 

0 

0 

0.4 

1.0 

0.3 

0.4 

0.5 

0 

1 

0 

0.5 

0.3 

1.0 

0.2 

0.6 

0 

0 

1 

.0.6 

0.4 

0.2 

1.0 

1 

-0.4762 

0 

0    -l 

-  . 

0.40 

0.50 

0.60 

1.00 

-j 

0 

1.1905 

0 

0 

0.84 

0.10 

0.16 

-0.40 

0 

-0.1190 

1 

0 

0.10 

0.75 

-0.10 

-0.50 

0 

-0.1905 

0 

1 

.  . 

0.16 

-0.10 

6.64 

-0.60 

. 

1 

0 

-0.6129 

0    -j 

r  . 

0.4524 

0.5238 

1.1905  -0.4762 

1 

0 

1 

-0.1612 

0 

0.1190 

0.1905 

-0.4762   1.1905 

0 

0 

1.3548 

0 

0.7381 

-0.1190 

-0.4524  -0.1190 

0 

0 

0.1612 

1 

.  . 

-0.1190 

0.6095 

-0.5238  -0.1905 

. 

1 

0 

0 

-1.0108-1 

-  . 

0.5967 

1.4678  -0.4033  -0.6129 

-i 

0 

1 

0 

-0.3552 

0.2097 

-0.4033   1.2097-0.1612 

0 

0 

1 

0.2731 

-0.1612 

-0.6129  -0.1612   1.3548 

0 

0 

0 

1.6940. 

.  . 

0.5903 

-0.5967  -0.2097  0.1612 

. 

2.0709  -0.1913  -0.7758  - 
-0.1913   1.2842  -0.2185  - 
-0.7758  -0.2185   1.3988 

-1.0108-1 

-0.3552 

0.2731 

-1.0108-0.3552  0.2731 

1.6940. 

/  A-1 

Since  the  original  matrix  A  was  symmetric,  the  inverse  A  ~ 1  is  also  symmetric. 
It  is  therefore  unnecessary  to  compute  separately  the  terms  of  A " 1  below  the 
main  diagonal,  except  as  checks  on  the  arithmetic. 

A.24  Solution  of  a  Set  of  Normal  Equations  by  the  Square  Root  Method 

The  given  equation,  Ab  =  g,  can  be  solved  if  we  can  find  a  triangular  matrix 
S  (that  is,  a  square  matrix  with  all  the  elements  below,  or  above,  the  principal 
diagonal  equal  to  zero)  such  that 

(A.24.1)  S'S  =  A 

If  so,  we  have 

S'(Sb)  =  g 

which  is  equivalent  to  the  two  matrix  equations 

(A.24.2)  Sb  =  k,  S'k  =  g 

Since  S  and  S'  are  triangular,  these  equations  are  comparatively  easy  to  solve. 
The  first  step  is  to  find  the  elements  of  S  by  solving  the  equations,  equivalent 
to  Eq.  (1), 

Su2  =  flu 

^11^12  =  ^12 


(A.24.3) 


^12     +  ^22     —  0; 
•^12^13  +  ^22^23  =  0; 


Slp2  +  S^    ! 


'2p 


+  5pp   =aPP 


(A.24) 
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These  equations  are  solved  in  order,  beginning  with  the  first. 
The  second  step  is  to  find  the  elements  of/:  by  solving  S'k  =  g,  which,  when 
written  out,  is : 


(A. 24.4) 


Sl2k1  +  022*2  =  9i 


\Slpkl  +  S2pk2  +  .  .  .  +  Sppkp  =  gp 
The  final  step  is  to  solve  the  set  equivalent  to  Sb  =  k,  namely, 
iS11b1  +  St2b2  +  ...  +  Slpbp  =  k1 


(A. 24.5) 


S22b2  +  .  .  .  +  S2pbp  =  k2 


Sp-i^-ibp-t  +  Sp_lpbp  —  kp_t 
Sppbp  =  kp 


These  are  solved  backwards,  beginning  with  the  last  equation,  obtaining  first 
bp9  then  bp_1,  and  so  on.  The  process  may  be  illustrated  by  the  following 
example : 

586i  +    23b2  +      43Z>3  =     160 

23^  +    l%b2  +     I6863  =  1910 

43^!  +  16862  +  1096^3  =    240 

The  various  steps,  including  a  check  column,  may  be  set  out  in  a  table: 

Table  A.24 


9 

9 

58 
23 
43 

23 

78 

168 

43 

168 

1096 

160 

1910 

240 

284 
2179 

1547 

k 

k 

7.616 

3.020 
8.299 

5.646 
18.189 
27.079 

21.008 

222.503 

-144.972 

37.290 

248.991 

-117.893 

-8.557 

38.545 

-5.354 

-7.557 

39.545 

-4.354 
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Step  1       Sn  =(58)1/2  =  7.616 

S-  =  tI6  =  3-020 

5-  =  7S-6  =  5-646 

S22  =  [7j8  -  (3.020)2]1/2  =  (68.88)1/2  =  8.299 

168-  (3.020X5.646)  ' 

23  8.299 

S33  =  [1096-(5.646)2-  (18.189)2] 1/2  =  27.079 

160 
Step  2       kx  =—-—  =  21.008 

.         1910  -  (3.020X21.008) 

*2  = tt^z^ =  222.503 

2  8.299 

_  240  -  (5.646X21.008)  -  (18.189)(222.503)  _ 


27.079 


_  144  972 

s'ep3    fc3=-i7^r  =  -5-354 

,=  222.503  -18.18iK-S.3S4) 


8.299 


21.008  -  (3.020)(38.545)  -  (5.646)(- 5.354) 
bl  = ^616 =  "§-557 

The  check  consists  in  computing  a  vector  g  whose  elements  are  the  sums  of  the 
corresponding  rows  in  A  and  g.  Thus 

gt  =  58  +  23  +  43  +  160  =  284 

Steps  2  and  3  are  repeated  with  g  instead  of  g,  giving  new  vectors  £  and  5.  Apart 
from  rounding-off  errors  in  the  last  decimal  place,  we  should  find 

Kj  =  Sjj  +  SjJ+l  +  ...-+Sjp  +  kj 

and 

Bj-bj  +  1. 

In  practice,  each  row  of  S  in  Table  A.24  is  completed,  and  the  check  applied, 
before  the  next  row  is  started. 
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Table  B.l  Random  Sampling  Numbers* 

First  Thousand 

1-4 

5-8 

Q-12 

13-16 

17-20 

21-24 

25-28 

29-32 

33-36 

37-40 

I. 

23  15 

75  48 

59  01 

83  72 

59  93 

76  24 

97  08 

86  95 

23  03 

67  44 

2 

05  54 

55  50 

43  10 

53  74 

35  08 

90  61 

18  37 

44  10 

96  22 

13  43 

3 

1487 

16  03 

50  32 

40  43 

62  23 

50  05 

10  03 

22  1 1 

54  38 

08  34 

4 

38  97 

67  49 

5i  94 

05  17 

58  53 

78  80 

59  01 

94  32 

42  87 

16  95 

5 

97  3i 

26  17 

18  99 

75  53 

08  70 

94  25 

12  58 

4i  54 

88  21 

05  13 

6 

11  74 

26  93 

81  44 

33  93 

08  72 

32  79 

73  3i 

18  22 

64  70 

68  50 

7 

43  36 

12  88 

59  11 

01  64 

56  23 

93  00 

90  04 

99  43 

64  07 

40  36 

8 

93  80 

62  04 

7838 

26  80 

44  9i 

55  75 

11  89 

32  58 

47  5  5 

25  7i 

9 

49  54 

01  31 

81  08 

42  98 

41  87 

69  53 

82  96 

61  77 

73  80 

95  27 

10 

36  76 

87  26 

33  37 

94  82 

15  69 

4i  95 

96  86 

7o  45 

27  48 

38  80 

ii 

07  09 

25  23 

92  24 

62  71 

26  07 

06  55 

84  53 

44  67 

33  84 

53  20 

12 

43  3i 

00  10 

81  44 

86  38 

03  07 

52  55 

51  61 

48  89 

74  29 

46  47 

13 

61  57 

00  63 

60  06 

17  36 

37  75 

63  14 

89  5i 

23  35 

01  74 

69  93 

14 

3i  35 

28  37 

99  10 

77  9i 

89  4i 

3i  57 

97  64 

48  62 

5848 

69  19 

15 

57  04 

88  65 

26  27 

79  59 

36  82 

90  52 

95  65 

46  35 

06  53 

22  54 

16 

09  24 

34  42 

00  68 

72  10 

7.i  37 

30  72 

97  57 

56  09 

29  82 

76  50 

17 

97  95 

53  '5o 

18  40 

89  48 

83  29 

52  23 

08  25 

21  22 

53  26 

15  87 

18 

93  73 

25  95 

70  43 

78  19 

88  85 

5667 

16  68 

26  95 

99  64 

45  69 

*9 

72  62 

1 1  12 

25  00 

92  26 

82  64 

3566 

65  94 

34  7i 

6875 

18  67 

20 

61  02 

07  44 

1845 

37  12 

07  94 

95  9i 

73  78 

66  99 

53  61 

93  78 

21 

97  83 

98  54 

74  33 

05  59 

17  18 

45  47 

35  4i 

44  22 

03  42 

30  00 

22 

89  16 

09  71 

92  22 

23  29 

06  37 

35  05 

54  54 

89  88 

43  81 

63  61 

23 

25  96 

68  82 

20  62 

87  17 

92  65 

02  82 

35  28 

62  84 

9i  95 

48  83 

24 

81  44 

33  17 

J9  05 

04  95 

48  06 

74  69 

00  75 

67  65 

01  71 

65  45 

25 

11  32 

25  49 

31  42 

36  23 

43  86 

08  62 

49  76 

67  42 

24  52 

32  45 

Second  Thousand 

1-4 

5-8 

Q-12 

-rj-i-6 

iy-20 

21-24 

25-28 

29-32 

33-36 

37-40 

I 

64  75 

5838 

85  84 

12  22 

59  20 

17  69 

61  56 

55  95 

04  59 

59  47 

2 

10  30 

25  22 

89  77 

43  63 

44  30 

38  11 

24  90 

67  07 

34  82 

33  28 

3 

71  01 

79  84 

95  5i 

3085 

03  74 

66  59 

10  28 

87  53 

7656 

91  49 

4 

60  01 

25  56 

05  88 

4i  03 

48  79 

79  65 

59  01 

69  78 

80  00 

36  66 

5 

37  33 

09  46 

56  49 

16  14 

28  02 

4827 

45  47 

55  44 

55  36 

50  90 

6 

47  86 

98  70 

01  31 

59  11 

22  73 

60  62 

61  28 

22  34 

69  16 

12  12 

7 

38  04 

04  27 

37  64 

16  78 

95  78 

39  32 

34  93 

24  88 

43  43 

87  06 

8 

73  50 

83  09 

08  83 

05  48 

00  78 

36  66 

93  °2 

95  56 

46  04 

53  36 

9 

32  62 

34  64 

74  84 

06  10 

43  24 

20  62 

83  73 

19  32 

35  64 

39  69 

10 

97  59 

19  95 

49  36 

63  03 

51  06 

62  06 

99  29 

75  95 

32  05 

77  34 

ii 

74  01 

23  19 

55  59 

79  09 

69  82 

66  22 

42  40 

15  96 

74  9o 

75  89 

12 

56  75 

42  64 

57  13 

35  10 

50  14 

90  96 

63  36 

74  69 

09  63 

34  88 

13 

49  80 

04  99 

08  54 

83  12 

19  98 

08  52 

82  63 

72  92 

92  36 

50  26 

14 

43  58 

48  96 

47  24 

8785 

66  70 

00  22 

15  01 

93  99 

59  16 

23  77 

15 

16  65 

37  96 

64  60 

32  57 

13  01 

35  74 

28  36 

36  73 

05  88 

72  29 

16 

48  50 

26  90 

55  65 

32  25 

8748 

31  44 

68  02 

37  3i 

25  29 

63  67 

17 

96  76 

55  46 

92  36 

31  68 

62  30 

48  29 

6383 

52  23 

81  66 

40  94 

18 

38  92 

36  15 

50  80 

35  78 

17  84 

23  44 

4i  24 

63  33 

99  22 

81  28 

19 

77  95 

88  16 

94  25 

22  50 

55  87 

51  07 

30  10 

70  60 

21  86 

19  61 

20' 

17  92 

82  80 

65  25 

58  60 

8771 

02  64 

18  50 

64  65 

79  64 

81  70 

21 

94  03 

68  59 

78  02 

31  80 

44  99 

41  05 

41  05 

3i  87 

43  12 

15  96 

22 

47  46 

06  04 

79  56 

23  04 

84  17 

14  37 

28  51 

67  27 

55  80 

03  68 

23 

47  85 

65  60 

88  51 

99  28 

24  39 

40  64 

4i  7i 

70  13 

46  31 

82  88 

24 

57  61 

63  46 

53  92 

29  86 

20  18 

10  37 

57  65 

15  62 

98  69 

07  56 

25 

08  30 

09  27 

04  66 

75  26 

66  10 

57  18 

87  9i 

07  54 

22  22 

20  13 

♦Reproduced  with  the  permission  of  Professor  E.  S.  Pearson  from  M.  G.  Kendall  and 
B.  Babington  Smith,  Tables  of  Random  Sampling  Numbers  (Tracts  for  Computers,  No.  24), 
Cambridge  Univ.  Press. 
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Table  B.l  (cont.)  • 
Random  Sampling  Numbers 


Third  Thousand 

1-4 

5-8 

0-12 

13-16 

1  j- 20 

21-24 

25-28 

29-32 

33-36 

37-40 

I 

89  22 

10  23 

62  65 

78  77 

47  33 

51  27 

23  02 

13  92 

44  13 

96  51 

2 

04  00 

59  98 

18  63 

91  82 

90  32 

94  01 

24  23 

63  01 

26  1 1 

06  50 

3 

98  54 

63  80 

66  50 

85  67 

50  45 

40  64 

52  28 

4i  53 

25  44 

4i  25 

4 

41  71 

98  44 

01  59 

22  60 

13  14 

54  58 

14  03 

98  49 

98  86 

55  79 

5 

28  73 

37  24 

89  00 

78  52 

58  43 

24  61 

34  97 

97  85 

5678 

44  7i 

6 

65  21 

38  39 

27  77 

76  20 

30  86 

80  74 

22  43 

95  68 

47  68 

37  92 

7 

65  55 

31  26 

78  90 

90  69 

04  66 

43  67 

02  62 

17  69 

90  03 

12  05 

8 

05  66 

86  90 

8073 

02  98 

57  46 

58  33 

27  82 

3i  45 

98  69 

29  98 

9 

39  30 

29  97 

18  49 

75  77 

95  19 

27  38 

77  63 

73  47 

26  29 

16  12 

10 

64  59 

23  22 

54  45 

87  92 

94  3i 

38  32 

00  59 

81  18 

06  78 

7i  37 

ii 

07  5i 

34  87 

92  47 

3i  48 

36  60 

68  90 

7o  53 

36  82 

57  99 

15  82 

12 

86  59 

36  85 

01  56 

63  89 

98  00 

82  83 

93  5i 

48  56 

54  10 

72  32 

13 

83  73 

52  25 

99  97 

97  78 

12  48 

3683 

89  95 

60  32 

41  06 

76  14 

14 

08  59 

52  18 

26  54 

65  50 

82  04 

87  99 

01  70 

33  56 

25  80 

•53  84 

15 

41  '27 

32  71 

49  44 

29  36 

94  58 

16  82 

8639 

62  15 

8643 

54  3i 

16 

00  47 

37  59 

08  56 

23  81 

22  42 

72  63 

17  63 

14  47 

25  20 

63  47 

17 

86  13 

15  37 

89  81 

38  30 

78  68 

89  13 

29  61 

82  07 

00  98 

64  32 

18 

33  84 

9783 

59  04 

40  20 

35  86 

03  17 

68  86 

63  08 

01  82 

25  46 

19 

61  87 

04  16 

57  07 

46  80 

86  12 

98  08 

39  73 

49  20 

77  54 

50  9i 

20 

43  89 

86  59 

23  25 

07  88 

61  29 

78  49 

19  76 

53  91 

50  08 

07  86 

21 

29  93 

93  9i 

23  04 

54  84 

59  85 

60  95 

20  66 

41  28 

72  64 

64  73 

22 

38  50 

58  55 

55  14 

3885 

50  77 

18  65 

79  48 

8767 

83  17 

08  19 

23 

31  82 

43  84 

31  67 

12  52 

55  11 

72  04 

4i  15 

62  53 

27  98 

22  68 

24 

9i  43 

00  37 

67  13 

56  1 1 

55  97 

06  75 

09  25 

52  02 

39  13 

87  53 

25 

38  63 

56  89 

76  25 

49  89 

75  26 

96  45 

80  38 

05  04 

11  66 

35  14 

Fourth  Thousand 

1-4 

5-8 

9- 1 2 

13-16 

17-20 

21-24 

25-28 

29-32 

33-36 

37-40 

I 

02  49 

05  4i 

22  27 

94  43 

93  64 

04  23 

07  20 

74  11 

67  95 

40  82 

2 

1 1  96 

73  64 

69  60 

62  78 

37  01 

09  25 

33  02 

08  01 

38  53 

74  82 

3 

48  25 

68  34 

65  49 

69  92 

40  79 

05  40 

33  5i 

54  39 

61  30 

3i  36 

4 

27  24 

67  30 

80  21 

48  12 

35  36 

04  88 

18  99 

77  49 

48  49 

30  71 

5 

32  53 

27  72 

65  72 

43  07 

07  22 

86  52 

91  84 

57  92 

65  7i 

00  1 1 

6 

66  75 

79  89 

55  92 

37  59 

34  3i 

43  20 

45  58 

25  45 

44  36 

92  65 

7 

1 1  26 

63  45 

45  76 

50  59 

77  46 

34  66 

82  69 

99  26 

74  29 

75  16 

8 

17  87 

23  9i 

42  45 

56  18 

01  46 

93  13 

74  89 

24  64 

25  75 

92  84 

9 

62  56 

13  03 

65  03 

40  81 

47  54 

5i  79 

80  81 

33  61 

01  09 

77  30 

10 

62  79 

63  07 

79  35 

49  77 

05  01 

30  10 

50  81 

33  00 

99  79 

19  70 

11 

75  5i 

02  17 

71  04 

33  93 

36  60 

42  75 

76  22 

23  87 

56  54 

84  68 

12 

87  43 

90  16 

91  63 

5i  72 

65  90 

44  43 

70  72 

17  98 

70  63 

90  32 

13 

97  74 

20  26 

21  10 

74  87 

88  03 

38  33 

76  52 

26  92 

14  95 

90  51 

14 

98  81 

10  60 

01  21 

57  10 

28  75 

21  82 

88  39 

12  85 

18  86 

16  24 

15 

51  26 

40  18 

52  64 

60  79 

25  53 

29  00 

42  66 

95  78 

5836 

29  98 

\     l6 

40  23 

99  33 

76  10 

41  96 

86  10 

49  12 

00  29 

41  80 

03  59 

93  17 

17 

26  93 

65  9i 

86  51 

66  72 

76  45 

46  32 

94  46 

81  94 

19  06 

66  47 

18 

88  50 

21  17 

16  98 

29  94 

09  74 

42  39 

46  22 

00  69 

09  48 

16  46 

19 

63  49 

93  80 

93  25 

59  36 

19  95 

79  86 

7805 

69  01 

02  33 

83  74 

20 

36  37 

98  12 

06  03 

3i  77 

87  10 

73  82 

83  10 

83  60 

50  94 

40  91 

21 

93  80 

12  23 

22  47 

47  95 

70  17 

59  33 

43  06 

47  43 

06  12 

66  60 

22 

29  85 

68  71 

20  56 

3i  15 

00  53 

25  36 

58  12 

65  22 

41  40 

24  31 

23 

97  72 

08  79 

31  88 

26  51 

30  50 

71  01 

7i  5i 

77  06 

95  79 

29  19 

24 

85  23 

70  91 

05  74 

60  14 

63  77 

59  93 

81  56 

47  34 

17  79 

27  53 

25 

75  74 

67  52 

68  31 

72  79 

57  73 

72  36 

48  73 

24  36 

87  90 

68  02 
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Table  B.l  (cont.) 
Random  Sampling  Numbers 


Fifth  Thousand 

1-4 

5-8 

9-12 

13-16 

ij-20 

21-24 

25-28 

29-32 

33-36 

37-40 

I 

29  93 

50  69 

7i  63 

17  55 

25  79 

10  47 

8893 

79  61 

42  82 

13  63 

2 

IS  ii 

40  71 

26  51 

89  07 

77  87 

75  5i 

01  31 

03  42 

94  24 

81  11 

3 

03  87 

04  32 

25  10 

58  98 

76  29 

22  03 

99  4i 

24  3& 

12  76 

50  22 

4 

79  39 

03  91 

88  40 

75  64 

52  69 

65  95 

92  06 

40  14 

28  42 

29  60 

5 

30  03 

50  69 

15  79 

19  65 

44  28 

64  81 

95  23 

14  48 

72  18 

15  94 

6 

29  03 

99  98 

61  28 

75  97 

98  02 

6853 

13  9i 

98  38 

13  72 

43  73 

7 

78  19 

60  81 

08  24 

10  74 

97  77 

09  59 

94  35 

69  84 

82  09 

49  56 

8 

15  84 

78  54 

93  9i 

44  29 

13  5i 

80  13 

07  37 

52  21 

53  9i 

09  86 

9 

36  61 

46  22 

48  49 

19  49 

72  09 

92  58 

79  20 

53  4i 

02  18 

00  64 

10 

40  54 

95  48 

84  9i 

46  54 

38  62 

35  54 

14  44 

66  88 

89  47 

41  80 

ii 

40-87 

80  89 

97  14 

28  60 

99  82 

90  30 

87  80 

07  5i 

5871 

66  58 

12 

10  22 

94  92 

82  41 

17  33 

14  68 

59  45 

5i  87 

56  08 

90  80 

66  60 

13 

15  9i 

8767 

87  30 

62  42 

59  28 

44  12 

42  50 

88  31 

13  77 

16  14 

14 

13  40 

3i  87 

96  49 

90  99 

44  04 

64  97 

94  14 

62  18 

15  59 

83  35 

15 

66  52 

39  45 

96  74 

90  89 

02  71 

10  00 

99  86 

48  17 

64  06 

89  09 

16 

91  66 

53  64 

69  68 

34  3i 

78  70 

25  97 

50  46 

62  21 

27  25 

06  20 

17 

67  41 

58  75 

15  08 

20  77 

37  29 

73  20 

15  75 

93  96 

9i  76 

96  99 

18 

76  52 

79  69 

96  23 

72  43 

34  48 

63  39 

23  23 

94  60 

88  79 

06  17 

19 

19  81 

54  77 

89  74 

34  8i 

7i  47 

10  95 

43  43 

5581 

19  45 

44  07 

20 

25  59 

25  35 

87  76 

3S47 

25  75 

84  34 

76  89 

18  05 

73  95 

72  22 

21 

55  90 

24  55 

39  63 

64  63 

16  09 

95  99 

98  28 

87  40 

66  66 

66  92 

22 

02  47 

05  83 

76  79 

79  42 

24  82 

42  42 

39  61 

62  47 

49  11 

72  64 

23 

18  63 

05  32 

63  13 

3i  99 

76  19 

35  85 

91  23 

5°  14 

63  28 

86  59 

24 

8967 

33  82 

30  16 

06  39 

20  07 

59  50 

33  84 

02  76 

45  03 

33  33 

25 

62  98 

66  73 

64  06 

59  5i 

74  27 

84  62 

3i  45 

65  82 

86  05 

73  00 

Sixth  Thousand 

1-4 

5-8 

9-12 

13-16 

iy-20 

21-24 

25-28 

29-32 

33-36 

37-40 

I 

27  50 

13  05 

46  34 

63  85 

87  60 

35  55 

05  67 

88  15 

47  00 

50  92 

2 

02  31 

57  57 

62  98 

4i  09 

66  qi 

69  88 

92  83 

35  7o 

76  59 

02  58 

3 

37  43 

12  83 

66  39 

77  33 

63  26 

53  99 

48  65 

23  06 

94  29 

53  04 

4 

83  56 

65  54 

19  33 

35  42 

92  12 

37  14 

7o  75 

18  58 

98  57 

12  52 

5 

06  81 

56  27 

49  32 

12  42 

92  42 

05  96 

82  94 

70  25 

45  49 

18  16 

6 

39  15 

03  60 

15  56 

73  16 

48  74 

50  27 

43  42 

5836 

73  16 

39  90 

7 

84  45 

7i  93 

10  27 

1583 

84  20 

57  42 

41  28 

42  06 

15  90 

70  47 

8 

82  47 

05  77 

06  89 

47  13 

92  85 

60  12 

32  89 

25  22 

42  38 

87  37 

9 

98  04 

06  70 

24  21 

69  02 

65  42 

55  33 

11  95 

72  35 

73  23 

57  26 

10 

1833 

49  04 

14  33 

48  50 

15  64 

58  26 

14  9i 

46  02 

72  13 

48  62 

ii 

33  92 

19  93 

38  27 

43  40 

27  72 

79  74 

8657 

4i  83 

5871 

56  99 

12 

48  66 

74  30 

44  81 

06  80 

29  09 

50  31 

69  61 

24  64 

28  89 

97  79 

13 

85  85 

07  54 

21  50 

31  80 

10  19 

5665 

82  52 

26  58 

55  12 

26  34 

14 

08  27 

08  08 

35  87 

96  57 

33  12 

01  77 

52  76 

09  89 

71  12 

17  69 

15 

59  61 

22  14 

26  09 

96  75 

17  94 

51  08 

4i  9i 

45  94 

80  48 

59  92 

16 

17  45 

77  79 

31  66 

36  54 

92  85 

65  60 

53  98 

63  50 

11  20 

96  63 

17 

1 1  26 

37o8 

07  71 

95  95 

39  75 

92  48 

99  78 

23  33 

19  56 

06  67 

18 

48  08 

13  98 

16  52 

4i  15 

73  96 

32  55 

03  12 

38  30 

8877 

17  03 

19 

76  27 

72  22 

99  61 

72  15 

00  25 

21  54 

47  79 

18  41 

58  50 

57  66 

20 

98  89 

22  25 

72  92 

53  55 

07  98 

66  71 

53  29 

61  71 

56  96 

4i  78 

21 

88  69 

61  63 

01  67 

61  88 

58  79 

35  65 

08  45 

63  38 

69  86 

79  47 

22 

12  58 

13  75 

80  98 

01  35 

91  16 

18  36 

90  54 

99  17 

68  36 

85  06 

23 

08  86 

96  36 

14  09 

43  85 

51  20 

65  18 

06  40 

52  17 

48  10 

68  97 

24 

33  81 

05  5i 

32  48 

60  12 

32  44 

08  12 

89  00 

98  82 

79  17 

97  22 

25 

05  15 

99  28 

87  15 

07  08 

66  92 

53  81 

69  42 

02  27 

65  33 

57  69 
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Table  B.l  (cont.)     • 
Random  Sampling  Numbers 


Seventh  Thousand 

1-4 

5-8 

0-12 

13-16 

iy-20 

21-24 

25-28 

29-32 

33-36 

37-40 

J 

8o  30 

23  64 

67  96 

21  33 

36  90 

03  91 

69' 33 

90  13 

34  48 

02  19 

2 

61  29 

89  61 

32  08 

12  62 

26  08 

42  00 

3i  73 

31  30 

30  61 

34  n 

;  3 

23  33 

61  01 

02  21 

11  81 

5i  32 

36  10 

23  74 

50  31 

90  11 

73  52 

4 

94  21 

32  92 

93  50 

72  67 

23  20 

74  59 

30  30 

48  66 

75  32 

27  97 

5 

87  61 

92  69 

01  60 

28  79 

7476 

86  06 

39  29 

73  85 

03  27 

50  57 

6 

37  56 

19  18 

03  42 

86  03 

85  74 

44  81 

8645 

71  16 

13  52 

35  56 

7 

6486 

66  31 

55  04 

88  40 

10  30 

8438 

06  13 

5883 

62  04 

63  52 

8 

22  69 

58  45 

49  23 

09  81 

98  84 

05  04 

75  99 

27  70 

72  79 

32  19 

9 

23  22 

14  22 

64  90 

10  26 

74  23 

53  9i 

27  73 

78  19 

92  43 

68  10 

JO 

42  38 

59  64 

72  96 

46  57 

8967 

22  81 

94  50 

69  84 

18  31 

06  39 

ii 

17  18 

01  34 

10  98 

37  48 

93  86 

88  59 

69  53 

78  86 

37  26 

85  48 

12 

39  45 

69  53 

94  89 

58  97 

29  33 

29  19 

50  94 

80  57 

3i  99 

38  91 

13 

43  18 

11  42 

56  19 

48  44 

45  02 

84  29 

01  78 

65  77 

7684 

88  85 

14 

59  44 

06  45 

6855 

16  65 

66  13 

38  00 

95  76 

50  67 

6765 

18  83 

15 

01  50 

34  32 

38  00 

37  57 

47  82 

66  59 

19  50 

87  14 

35  59 

79  47 

16 

79  14 

60  35 

47  95 

90  71 

3i  03 

85  37 

38  70 

34  16 

64  55 

66  49 

17 

01  56 

63  68 

80  26 

14  97 

23  88 

59  22 

82  39 

70  83 

48  34 

46  48 

18 

25  76 

18  71 

29  25 

15  5i 

92  96 

01  01 

28  18 

03  35 

1 1  10 

27  84 

19 

23  52 

10  83 

45  06 

49  85 

35  45 

84  08 

81  13 

52  57 

21  23 

67  02 

20 

91  64 

08  64 

25  74 

16  10 

97  3i 

10  27 

2448 

89  06 

42  81 

29  10 

21 

80  86 

07  27 

26  70 

08  65 

85  20 

3i  23 

28  99 

39  63' 

32  03 

71  91 

22 

31  7i 

37  60 

95  60 

94  95 

54  45 

27  97 

03  67 

30  54 

86  04 

12  41 

23 

05  83 

50  36 

09  04 

39  15 

6655 

80  36 

39  7i 

24  10 

62  22 

21  53  ! 

24 

98  70 

02  90 

30  63 

62  59 

26  04 

97  20 

00  91 

28  80 

40  23 

09  91 

25 

82  79 

35  45 

64  53 

93  24 

86  55 

4872 

i8  57 

05  79 

20  09 

3i  46 

Eighth  Thousand 

1-4 

5-8 

Q-12 

13-16 

17-20 

21-24 

25-28 

29-32 

33-36 

37-40 

I 

37  52 

49  55 

40  65 

27  61 

08  59 

91  23 

26  18 

95  04 

98  20 

99  52 

2 

48  16 

69  65 

69  02 

08  83 

08  83 

68  37 

00  96 

13  59 

12  16 

17  93 

3 

50  43 

06  59 

56  53 

30  61 

40  21 

29  06 

49  60 

90  38 

21  43 

19  25 

4 

89  3i 

62  79 

45  73 

71  72 

77  ii 

28  80 

72  35 

75  77 

24  72 

98  43 

5 

63  29 

90  61 

86  39 

07  38 

3885 

77  06 

10  23 

30  84 

07  95 

30  76 

6 

71  68 

93  94 

08  72 

36  27 

85  89 

40  59 

83  37 

93  85 

73  97 

84  05 

7 

05  06 

96  63 

58  24 

05  95 

56  64 

77  53 

85  64 

15  95 

93  9i 

59  03 

8 

03  35 

58  95 

46  44 

25  7o 

31  66 

01  05 

44  44 

62  91 

36  31 

45  04 

9 

13  04 

57  67 

74  77 

53  35 

93  5i 

82  83 

27  38 

63  16 

04  48 

75  23 

JO 

49  96 

43  94 

56  04 

02  79 

55  78 

01  44 

75  26 

85  54 

01  81 

32  82 

ii 

24  36 

24  08 

44  77 

57  07 

54  4i 

04  56 

09  44 

3058 

25  45 

37  56 

12 

55  19 

97  20 

01  1 1 

47  45 

79  79 

06  72 

12  81 

86  97 

54  09 

06  53 

13 

02  28 

54  60 

28  35 

32  94 

36  74 

51  63 

96  90 

04  13 

30  43 

10  14 

14 

90  50 

13  78 

22  20 

37  56 

97  95 

49  95 

9i  15 

52  73 

12  93 

78  94  1 

15 

33  7i 

32  43 

2958 

47  38 

39  96 

67  5i 

6447 

49  9i 

6458 

93  07   | 

16 

705k 

28  49 

54  32 

97  7o 

27  81 

64  69 

7i  52 

02  56 

61  37 

04  58 

17 

09  68 

96  10 

57  78 

85  00 

89  81 

98  30 

19  40 

76  28 

62  99 

99  83 

18 

19  36 

60  85 

35  ?4 

12  87 

83  88 

66  54 

32  00 

30  20 

05  3o 

42  63   ! 

19 

04  75 

44  49 

64  26 

5i  46 

80  50 

53  9i 

00  55 

67  36 

68  66 

08  29   i 

20 

79  83 

32  39 

46  77 

5683 

42  21 

60  03 

14  47 

07  01 

66  85 

49  22 

21 

80  99 

42  43 

08  58 

54  4i 

98  05 

54  39 

34  42 

97  47 

38  35 

59  40 

22 

4883 

64  99 

86  94 

4878 

79  20 

62  23 

56  45 

92  65 

56  36 

83  02   I 

23 

2845 

35  85 

22  20 

13  01 

73  96 

70  05 

84  50 

68  59 

96  58 

16  63 

24 

52  07 

63  15 

82  30 

66  23 

14  26 

66  61 

17  80 

4i  97 

40  27 

24  80 

25 

39  14 

52  18 

35  87 

48  55 

48  81 

03  11 

26  99 

03  80 

08  86 

50  42 
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z 

<t>{z) 

So*<f>(z)dz 

z 

<t>(z) 

f<?4>(z)dz 

z 

4>{z) 

f<f<t>(z)dz 

.00 

.39894 

.00000 

.45 

.36053 

. 17364 

.90 

.26609 

.31594 

.01 

.39892 

.00399 

.46 

.35889 

. 17724 

.91 

.26369 

.31859 

.02 

.39886 

.00798 

.47 

.35723 

. 18082 

.92 

.26129 

.32121 

.03 

.39876 

.01197 

.48 

.35553 

. 18439 

.93 

.25888 

.32381 

.04 

.39862 

.01595 

.49 

.35381 

. 18793 

.94 

.25647 

.32639 

.05 

.39844 

.01994 

.50 

.35207 

. 19146 

.95 

.25406 

.32894 

.06 

.39822 

.02392 

.51 

.35029 

. 19497 

.96 

.25164 

.33147 

.07 

.39797 

.02790 

.52 

.34849 

. 19847 

.97 

.24923 

.33398 

.08 

.39767 

.03188 

.53 

.34667 

.20194 

.98 

.24681 

.33646 

.09 

.39733 

.03586 

.54 

.34482 

.20540 

.99 

.24439 

.33891 

.10 

.39695 

.03983 

.55 

.34294 

.20884 

1.00 

.24197 

.34134 

.11 

.39654 

.04380 

.56 

.34105 

.21226 

1.01 

.23955 

.34375 

.12 

.39608 

.04776 

.57 

.33912 

.21566 

1.02 

.23713 

.34614 

.13 

.39559 

.05172 

.58 

.33718 

.21904 

1.03 

.23471 

.34850 

.14 

.39505 

.05567 

.59 

.33521 

.22240 

1.04 

.23230 

.35083 

.15 

.39448 

.05962 

.60 

.33322 

.22575 

1.05 

.22988 

.35314 

.16 

.39387 

.06356 

.61 

.33121 

.22907 

1.06 

.22747 

.35543 

.17 

.39322 

.06749 

.62 

32918 

.23237 

1.07 

.22506 

.35769 

.18 

.39253 

. 07142 

.63 

.32713 

.23565 

1.08 

.22265 

.35993 

.19 

.39181 

.07535 

.64 

.32506 

.23891 

1.09 

.22025 

.36214 

.20 

.39104 

.07926 

.65 

.32297 

.24215 

1.10 

.21785 

.36433 

.21 

.39024 

.08317 

.66 

.32086 

.24537 

1.11 

.21546 

.36650 

.22 

.38940 

.08706 

.67 

.31874 

.24857 

1.12 

.21307 

.36864 

.23 

.38853 

.09095 

.68 

.31659 

.25175 

1.13 

.21069 

.37076 

.24 

.38762 

.09483 

.69 

.31443 

.25490 

1.14 

.20831 

.37286 

.25 

.38667 

.09871 

.70 

.31225 

.25804 

1.15 

.20594 

.37493 

.26 

.38568 

. 10257 

.71 

.31006 

.26115 

1.16 

.20357 

.37698 

.27 

.38466 

. 10642 

.72 

.30785 

.26424 

1.17 

.20121 

.37900 

.28 

.38361 

. 11026 

.73 

.30563 

.26730 

1.18 

. 19886 

.38100 

.29 

.38251 

. 11409 

.74 

.30339 

.27035 

1.19 

. 19652 

.38298 

.30 

.38139 

. 11791 

.75 

.30114 

.27337 

1.20 

. 19419 

.38493 

.31 

.38023 

. 12172 

.76 

.29887 

.27637 

1.21 

. 19186 

.38686 

.32 

.37903 

. 12552 

.77 

.29659 

.27935 

1.22 

. 18954 

.38877 

.33 

.37780 

. 12930 

.78 

.29431 

.28230 

1.23 

. 18724 

.39065 

.34 

.37654 

. 13307 

.79 

.29200 

.28524 

1.24 

. 18494 

.39251 

.35 

.37524 

. 13683 

.80 

.28969 

. 28814 

1.25 

. 18265 

.39435 

.36 

.37391 

. 14058 

.81 

.28737 

.29103 

1.26 

. 18037 

.39617 

.37 

.37255 

. 14431 

.82 

.28504 

.29389 

1.27 

. 17810 

.39796 

.38 

.37115 

. 14803 

.83 

.28269 

.29673 

1.28 

. 17585 

.39973 

.39 

.36973 

. 15173 

.84 

.28034 

.29955 

1.29 

. 17360 

.40147 

.40 

.36827 

. 15542 

.85 

.27798 

.30234 

1.30 

. 17137 

.40320 

.41 

.36678 

, 15910 

.86 

.27562 

.30511 

1.31 

. 16915 

.40490 

.42 

.36526 

. 16276 

.87 

.27324 

.30785 

1.32 

. 16694 

.40658 

.43 

.36371 

. 16640 

.88 

.27086 

.31057 

1.33 

. 16474 

.40824 

.44 

.36213 

. 17003 

.89 

.26848 

.31327 

1.34 

. 16256 

.40988 
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Table  B.2  (cont.) 


Ordinates  and  Areas  of  the  Normal  Curve,  <f>{z) 

-V5 

e-z2/2 

z 

4>{z) 

foz<t>(z)dz 

z 

4>{z) 

f<?$(z)dz 

z 

*(*) 

f<f<t>(z)dz 

1.35 

. 16038 

.41149 

1.80 

.07895 

.46407 

2.25 

.03174 

.48778 

1.36 

. 15822 

.41309 

1.81 

.07754 

.46485 

2.26 

.03103 

.48809 

1.37 

. 15608 

.41466 

1.82 

.07614 

.46562 

2.27 

.03034 

.48840 

1.38 

. 15395 

.41621 

1.83 

.07477 

.46638 

2.28 

.02965 

.48870 

1.39 

. 15183 

.41774 

1.84 

.07341 

.46712 

2.29 

.02898 

.48899 

1.40 

.14973 

.41924 

1.85 

.07206 

.46784 

2.30 

.02833 

.48928 

1.41 

. 14764 

: 42073 

1.86 

.07074 

.46856 

2.31 

.02768 

.48956 

1.42 

. 14556 

.42220 

1.87 

.06943 

.46926 

2.32 

.02705 

.48983 

1.43 

. 14350 

.42364 

1.88 

.06814 

.46995 

2.33 

.02643 

.49010 

1.44 

. 14146 

.42507 

1.89 

.06687 

.47062 

2.34 

.02582 

.49036 

1.45 

.13943 

.42647 

1.90 

.06562 

.47128 

2.35 

.02522 

.49061 

1.46 

.13742 

.42786 

1.91 

.06439 

.47193 

2.36 

.02463 

.49086 

1.47 

. 13542 

.42922 

1.92 

.06316 

.47257 

2.37 

.02406 

.49111 

1.48 

. 13344 

.43056 

1.93 

.06195 

.47320 

2.38 

.02349 

.49134 

1.49 

.13147 

.43189 

1.94 

.06077 

.47381 

2.39 

.02294 

.49158 

1.50 

. 12952 

.43319 

1.95 

.05959 

.47441 

2.40 

.02239 

.49180 

1.51 

. 12758 

.43448 

1.96 

.05844 

.47500 

2.41 

.02186 

.49202 

1.52 

.12566 

.43574 

1.97 

.05730 

.47558 

2.42 

.02134 

.49224 

1.53 

. 12376 

.43699 

1.98 

.05618 

.47615 

2.43 

.02083 

.49245 

1.54 

. 12188 

.43822 

1.99 

.05508 

.47670 

2.44 

.02033 

.49266 

1.55 

. 12001 

.43943 

2.00 

.05399 

.47725 

2.45 

.01984 

.49286 

1.56 

.11816 

.44062 

2.01 

.05292 

.47778 

2.46 

.01936 

.49305 

1.57 

. 11632 

.44179 

2.02 

.05186 

.47831 

2.47 

.01889 

.49324 

1.58 

.11450 

.44295 

2.03 

.05082 

.47882 

2.48 

.01842 

.49343 

1.59 

.11270 

.44408 

2.04 

.04980 

.47932 

2.49 

.01797 

.49361 

1.60 

.11092 

.44520 

2.05 

.04879 

.47982 

2.50 

.01753 

.49379 

1.61 

.10915 

.44630 

2.06 

.04780 

.48030 

2.51 

.01709 

.49396 

1.62 

. 10741 

.44738 

2.07 

.04682 

.48077 

2.52 

.01667 

.49413 

1.63 

. 10567 

.44845 

2.08 

.04586 

.48124 

2.53 

.01625 

.49430 

1.64 

. 10396 

.44950 

2.09 

.04491  ' 

.48169 

2.54 

. 01585 

.49446 

1.65 

. 10226 

.45053 

2.10 

.04398 

.48214 

2.55 

.01545 

.49461 

1.66 

.10059 

.45154 

2.11 

.04307 

.48257 

2.56 

. 01506 

.49477 

1.67 

.09893 

.45254 

2.12 

.04217 

.48300 

2.57 

.01468 

.49492 

1.68 

.09728 

.45352 

2.13 

.04128 

.48341 

2.58 

.01431 

.49506 

1.69 

.09566 

.45449 

2.14 

.04041 

.48382 

2.59 

.01394 

.49520 

1.70 

.09405 

.45543 

2.15 

.03955 

.48422 

2.60 

.01358 

.49534 

1.71 

.09246 

.45637 

2.16 

.03871 

.48461 

2.61 

.01323 

.49547 

1.72 

.09089 

.45728 

2.17 

.03788 

.48500 

2.62 

.01289 

.49560 

1.73 

.08933 

.45818 

2.18 

.03706 

.48537 

2.63 

.01256 

.49573 

1.74 

.08780 

.45907 

2.19 

.03626 

.48574 

2.64 

.01223 

.49585 

1.75 

.08628 

.45994 

2.20 

.03547 

.48610 

2.65 

.01191 

.49598 

1.76 

.08478 

.46080 

2.21 

.03470 

.48645 

2.66 

.01160 

.49609 

1.77 

.08329 

.46164 

2.22 

.03394 

.48679 

2.67 

.01130 

.49621 

1.78 

.08183 

.46246 

2.23 

.03319 

.48713 

2.68 

.01100 

.49632 

1.79 

.08038 

.46327 

2.24 

.03246 

.48745 

2.69 

.01071 

.49643 
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z 

</>(*) 

Jl*<t>{z)dz 

z 

4>(z) 

Jl'<j>(z)dz 

z 

4>{z) 

£*<t>(z)dz 

2.70 

.01042 

.49653 

3.15 

.00279 

.49918 

3.60 

.00061 

.49984 

2.71 

.01014 

.49664 

3.16 

.00271 

.49921 

3.61 

.00059 

.49985 

2.72 

.00987 

.49674 

3.17 

.00262 

.49924 

3.62 

.00057 

.49985 

2.73 

.00961 

.49683 

3.18 

.00254 

.49926 

3.63 

.00055 

.49986 

2.74 

.00935 

.49693 

3.19 

.00246 

.49929 

3.64 

.00053 

.49986 

2.75 

.00909 

.49702 

3.20 

.00238 

.49931 

3.65 

.00051 

.49987 

2.76 

.00885 

.49711 

3.21 

.00231 

.49934 

3.66 

.00049 

.49987 

2.77 

.00861 

.49720 

3.22 

.00224 

.49936 

3.67 

.00047 

.49988 

2.78 

.00837 

.49728 

3.23 

. 00216 

.49938 

3.68 

.00046 

.49988 

2.79 

.00814 

.49736 

3.24 

.00210 

.49940 

3.69 

.00044 

.49989 

2.80 

.00792 

.49744 

3.25 

.00203 

.49942 

3.70 

.00042 

.49989 

2.81 

.00770 

.49752 

3.26 

.00196 

.49944 

3.71 

.00041 

.49990 

2.82 

.00748 

.49760 

3.27 

.00190 

.49946 

3.72 

.00039 

.49990 

2.83 

.00727 

.49767 

3.28 

.00184 

.49948 

3.73 

.00038 

.49990 

2.84 

.00707 

.49774 

3.29 

.00178 

.49950 

3.74 

.00037 

.49991 

2.85 

.00687 

.49781 

3.30 

.00172 

.49952 

3.75 

.00035 

.49991 

2.86 

.00668 

.49788 

3.31 

.00167 

.49953 

3.76 

.00034 

.49992 

2.87 

.00649 

.49795 

3.32 

.00161 

.49955 

3.77 

.00033 

.49992 

2.88 

.00631 

.49801 

3.33 

.00156 

.49957 

3.78 

.00031 

.49992 

2.89 

.00613 

.49807 

3.34 

.00151 

.49958 

3.79 

.00030 

.49992 

2.90 

.00595 

.49813 

3.35 

.00146 

.49960 

3.80 

.00029 

.49993 

2.91 

.00578 

.49819 

3.36 

.00141 

.49961 

3.81 

.00028 

.49993 

2.92 

.00562 

.49825 

3.37 

.00136 

.49962 

3.82 

.00027 

.49993 

2.93 

.00545 

.49831 

3.38 

.00132 

.49964 

3.83 

.00026 

.49994 

2.94 

.00530 

.49836 

3.39 

.00127 

.49965 

3.84 

.00025 

.49994 

2.95 

.00514 

.49841 

3.40 

.00123 

.49966 

3.85 

.00024 

.49994 

2.96 

.00499 

.49846 

3.41 

.00119 

.49968 

3.86 

.00023 

.49994 

2.97 

.00485 

.49851 

3.42 

.00115 

.49969 

3.87 

.00022 

.49995 

2.98 

.00471 

.49856 

3.43 

.00111 

.49970 

3.88 

.00021 

.49995 

2.99 

.00457 

.49861 

3.44 

.00107 

.49971 

3.89 

.00021 

.49995 

3.00 

.00443 

.49865 

3.45 

.00104 

.49972 

3.90 

.00020 

.49995 

3.01 

.00430 

.49869 

3.46 

.00100 

.49973 

3.91 

.00019 

.49995 

3.02 

.00417 

.49874 

3.47 

.00097 

.49974 

3.92 

.00018 

.49996 

3.03 

.00405 

.49878 

3.48 

.00094 

.49975 

3.93 

.00018 

.49996 

3.04 

.00393 

.49882 

3.49 

.00090 

.49976 

3.94 

.00017 

.49996 

3.05 

.00381 

.49886 

3.50 

.00087 

.49977 

3.95 

.00016 

.49996 

3.06 

.00370 

.49889 

3.51 

.00084 

.49978 

3.96 

.00016 

.49996 

3.07 

.00358 

.49893 

3.52 

.00081 

.49978 

3.97 

.00015 

.49996 

3.08 

.00348 

.49897 

3.53 

.00079 

.49979 

3.98 

.00014 

.49997 

3.09 

.00337 

.49900 

3.54 

.00076 

.49980 

3.99 

.00014 

.49997 

3.10 

.00327 

.49903 

3.55 

.00073 

.49981 

3.11 

.00317 

.49906 

3.56 

.00071 

.49981 

3.12 

.00307 

.49910 

3.57 

.00068 

.49982 

3.13 

.00298 

.49913 

3.58 

.00066 

.49983 

3.14 

.00288 

.49916 

3.59 

.00063 

.49983 
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Table  B.3 
Values  of  x2  Corresponding  to  Given  Probabilities* 


Degrees 
of 

Probability  of  a  deviation  greater  than  x2 

freedom 

n 

.01 

.02 

.05 

.10 

.20 

.30 

.50 

1 

6.635 

5.412 

3.841 

2.706 

!  1.642 

1.074 

.455 

2 

9.210 

7.824 

5.991 

4.605 

3.219 

2.408 

1.386 

3 

11.341 

9.837 

7.815 

6.251 

4.642 

3.665 

2.366 

4 

13.277 

11.668 

9.488 

7.779 

5.989 

4.878 

3.357 

6 

15.086 

13.388 

11.070 

9.236 

|  7.289 

6.064 

4.351 

6 

16.812 

15.033 

12.592 

10.645 

8.558 

7.231 

5.348 

7 

18.475 

16.622 

14.067 

12.017 

9.803 

8.383 

6.346 

8 

20.090 

18.168 

15.507 

13.362 

11.030 

9.524 

7.344 

9 

21.666 

19.679 

16.919 

14.684 

12.242 

10.656 

8.343 

10 

23.209 

21.161 

18.307 

15.987 

13.442 

11.781 

9.342 

11 

24.725 

22.618 

19.675 

17.275 

14.631 

12.899 

10.341 

12 

26.217 

24.054 

21.026 

18.549 

15.812 

14.011 

11.340 

13 

27.688 

25.472 

22.362 

19.812 

16.985 

15.119 

12.340 

14 

29.141 

26.873 

23.685 

21.064 

18.151 

16.222 

13.339 

15 

30.578 

28.259 

24.996 

22.307 

19.311 

17.322 

14.339 

16 

32.000 

29.633 

26.296 

23.542 

20.465 

18.418 

15.338 

17 

33.409 

30.995 

27.587 

24.769 

21.615 

19.511 

16.338 

18 

34.805 

32.346 

28.869 

25.989 

22.760 

20.601 

17.338 

19 

36.191 

33.687 

30.144 

27.204 

23.900 

21.689 

18.338 

20 

37.566 

35.020 

31.410 

28.412 

25.038 

22.775 

19.337 

21 

38.932 

36.343 

32.671 

29.615 

26.171 

23.858 

20.337 

22 

40.289 

37.659 

33.924 

30.813 

27.301 

24.939 

21.337 

23 

41.638 

38.968 

35.172 

32.007 

28.429 

26.018 

22.337 

24 

42.980 

40.270 

36.415 

33.196 

29.553 

27.096 

23.337 

25 

44.314 

41.566 

37.652 

34.382 

30.675 

28.172 

24.337 

26 

45.642 

42.856 

38.885 

35.563 

31.795 

29.246 

25.336 

27 

46.963 

44.140 

40.113 

36.741 

32.912 

30.319 

26.336 

28 

48.278 

45.419 

41.337 

37.916 

34.027 

31.391 

27.336 

29 

49.588 

46.693 

42.557 

39.087 

35.139 

32.461 

28.336 

30 

50.892 

47.962 

43.773 

40.256 

36.250 

33.530 

29.336 

For  larger  values  of  n,  the  quantity  (2x2)1/2  -  (2n  -  l)1'2  may  be  used  as  a  normal 
deviate  with  unit  standard  deviation. 

♦This  table  is  reproduced  from  "Statistical  Methods  for  Research  Workers,"  with  the 
generous  permission  of  the  author,  Sir  Ronald  A.  Fisher,  and  the  publishers,  Messrs.  Oliver 
and  Boyd. 
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Table  B.3  (cont.) 
Values  of  x2  Corresponding  to  Given  Probabilities* 


Degrees 
of 

Probability  of  a  deviation  greater  than  x2 

freedom 

n 

.70 

.80 

.90 

.95 

.98 

.99 

1 

.148 

.0642 

.0158 

.00393 

.000628 

.000157 

2 

.713 

.446 

.211 

.103 

.0404 

.0201 

3 

1.424 

1.005 

.584 

.352 

.185 

.115 

4 

2.195 

1.649 

1.064 

.711 

.429 

.297 

5 

3.000 

2.343 

1.610 

1.145 

.752 

.554 

6 

3.828 

3.070 

2.204 

1.635 

1.134 

.872 

7 

4.671 

3.822 

2.833 

2.167 

1.564 

1.239 

8 

5.527 

4.594 

3.490 

2.733 

2.032 

1.646 

9 

6.393 

5.380 

4.168 

3.325 

2.532 

2.088 

10 

7.267 

6.179 

4.865 

3.940 

3.059 

2.558 

11 

8.148 

6.989 

5.578 

4.575 

3.609 

3.053 

12 

9.034 

7.807 

6.304 

5.226 

4.178 

3.571 

13 

9.926 

8.634 

7.042 

5.892 

4.765 

4.107 

14 

10.821 

'9.467 

7.790 

6.571 

5.368 

4.660 

15 

11.721 

10.307 

8.547 

7.261 

5.985 

5.229 

16 

12.624 

11.152 

9.312 

7.962 

6.614 

5.812 

17 

13.531 

12.002 

10.085 

8.672 

7.255 

6.408 

18 

14.440 

12.857 

10.865 

9.390 

7.906 

7.015 

19 

15.352 

13.716 

11.651 

10.117 

8.567 

7.633 

20 

16.266 

14.578 

12.443 

10.851 

9.237 

8.260 

21 

17.182 

15.445 

13.240 

11.591 

9.915 

8.897 

22 

18.101 

16.314 

14.041 

12.338 

10.600 

9.542 

23 

19.021 

17.187 

14.848 

13.091 

11.293 

10.196 

24 

19.943 

18.062 

15.659 

13.848 

11.992 

10.856 

25 

20.867 

18.940 

16.473 

14.611 

12.697 

11.524 

26 

21.792 

19.820 

17.292 

15.379 

13.409 

12.198 

27 

22.719 

20.703 

18.114 

16.151 

14.125 

12.879 

28 

23.647 

21.588 

18.939 

16.928 

14.847 

13.565 

29 

24.577 

22.475 

19.768 

17.708 

15.574 

14.256 

30 

25.508 

23.364 

20.599 

18.493 

16.306 

14.953 

♦For  larger  values  of  n,  the  quantity  (2x2)1/2 
with  unit  standard  deviation. 


(2/7  —  \yl'2  may  be  used  as  a  normal  deviate 
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Table  B.4 
Values  of  /  Corresponding  to  Given  Probabilities* 


Degrees 
of 

r 

Probability  of  a  deviation  greater  than  t 

freedom  n 

.005 

.01 

!       .025 

.05 

.1 

.15 

1 

63.657 

31.821 

12.706 

6.314 

3.078 

1.963 

2 

9.925 

6.965 

4.303 

2.920 

!    1.886 

1.386 

3 

5.841 

4.541 

3.182 

2.353 

1.638 

1.250 

4 

4.604 

3.747 

2.776 

2.132 

1.533 

1.190 

6 

4.032 

3.365 

2.571 

2.015 

1.476 

1.156 

6 

3.707 

3.143 

2.447 

1.943 

1.440 

1.134 

7 

3.499 

2.998 

2.365 

1.895 

1.415 

1.119 

8 

3.355 

2.896 

2.306 

1.860 

1.397 

1.108 

9 

3.250 

2.821 

2.262 

1.833 

1.383 

1.100 

10 

3.169 

2.764 

2.228 

1.812 

\    1.372 

1.093 

11 

3.106 

2.718 

2.201 

1.796 

1.363 

1.088 

12 

3.055 

2.681 

2.179 

1.782 

1.356 

1.083 

13 

3.012 

2.650 

2.160 

1.771 

1.350 

1.079 

14 

2.977 

2.624 

2.145 

1.761 

1.345 

1.076 

15 

2.947 

2.602 

2.131 

1.753 

1.341 

1.074 

16 

2.921 

2.583 

2.120 

1.746 

1.337 

1.071 

17 

2.898 

2.567 

2.110 

1.740 

1.333 

1.069 

18 

2.878 

2.552 

2.10U 

1.734 

1.330 

1.067 

19 

2.861 

2.539 

2.093 

1.729 

1:328 

1.066 

20 

2.845 

2.528 

2.086 

1.725 

1.325 

1.064 

21 

2.831 

2.518 

2.080 

1.721 

1.323 

1.063 

22 

2.819 

2.508 

2.074 

1.717 

1.321 

1.061 

23 

2.807 

2.500 

2.069 

1.714 

1.319 

1.060 

24 

2.797 

2.492 

2.064 

1.711 

1.318 

1.059 

25 

2.787 

2.485 

2.060 

1.708 

1.316 

1.058 

26 

2.779 

2.479 

2.056 

1.706 

1.315 

1.058 

27 

2.771 

2.473 

2.052 

1.703 

1.314 

1.057 

28 

2.763 

2.467 

2.048 

1.701 

1.313 

1.056 

29 

2.756 

2.462 

2.045 

1.699 

1.311 

1.055 

30 

2.750 

2.457 

2.042 

1.697 

1.310 

1.055 

00 

2.576 

2.326 

1.960 

1.645 

1.282 

1.036 

The  probability  of  a  deviation  numerically  greater  than  /  is  twice  the  probability 
given  at  the  head  of  the  table. 


♦This  table  is  reproduced  from  "Statistical  Methods  for  Research  Workers,"  with  the 
generous  permission  of  the  author,  Sir  Ronald  A.  Fisher,  and  the  publishers,  Messrs.  Oliver 
and  Boyd. 
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Table  B.4  (cont .) 
Values  of  t  Corresponding  to  Given  Probabilities* 


Degrees 

Probability  of  a  deviation  greater  than  t 

of 

freedom  rt 

.2 

.25 

.3 

.35 

.4 

.45 

1 

1.376 

1.000 

.727 

.510 

.325 

.158 

2 

1.061 

.816 

.617 

.445 

.289 

.142 

3 

.978 

.765 

.584 

.424 

.277 

.137 

4 

.941 

.741 

.569 

.414 

.271 

.134 

5 

.920 

.727 

.559 

.408 

.267 

.132 

6 

.906 

.718 

.553 

.404 

.265 

.131 

1 

.896 

.711 

.549 

.402 

.263 

.130 

8 

.889 

.706 

.546 

.399 

.262 

.130 

9 

.883 

.703 

.543 

.398 

.261 

.129 

10 

.879 

.700 

.542 

,      .397 

,      .260 

.129 

11 

.876 

.697 

.540 

.396 

.260 

.129 

12 

.873 

.695 

.539 

.395 

.259 

.128 

13 

.870 

.694 

.538 

.394 

.259 

.128 

14 

.868 

.692 

.537 

.393 

.258 

.128 

15 

.866 

.691 

.536 

.393 

.258 

.128 

16 

.865 

.690 

.535 

.392 

.258 

.128 

17 

.863 

.689 

.534 

.392 

.257 

.128 

18 

.862 

.688 

.534 

.392 

..257 

.127 

19 

.861 

.688 

.533 

.391 

.257 

.127 

20 

.860 

.687      ! 

.533 

.391 

.257 

.127 

21 

.859 

.686 

.532 

.391 

.257 

.127 

22 

.858 

.686 

.532 

.390 

.256 

.127 

23 

.858 

.685 

.532 

.390 

.256 

.127 

24 

.857 

.685 

.531 

.390 

.256 

.127 

25 

.856 

.684 

.531 

.390 

.256 

.127 

26 

.856 

.684 

.531 

.390 

.256 

.127 

27 

.855 

.684 

.531 

.389 

.256 

.127 

28 

.855 

.683 

.530 

.389 

.256 

.127 

29 

.854 

.683 

.530 

.389 

.256 

.127 

30 

.854 

.683 

.530 

.389 

.256 

.127 

00 

.842 

.674 

.524 

.385 

.253 

.126 

The  probability  of  a  deviation  numerically  greater  than  t  is  twice  the  probability  given 
at  the  head  of  the  table. 
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Table  B.6 
Critical  Values  for  the  Kolmogorov  Test 
Values  of  DN  such  that  P[max|S;v(;c)  -  F{x)\  >  Dn] 


N 

a 

0.20 

0.10 

0.05 

0.01 

5 

0.446 

0.510 

0.565 

0.669 

6 

0.410 

0.470 

0.521 

0.618 

7 

0.381 

0.438 

0.486 

0.577 

8 

0.358 

0.411 

0.457 

0.543 

9 

0.339 

0.388 

0.432 

0.514 

10 

0.322 

0.368 

0.410 

0.490 

11 

0.307 

0.352 

0.391 

0.468 

12 

0.295 

0.338 

0.375 

0.450 

13 

0.284 

0.325 

0.361 

0.433 

14 

0.274 

0.314 

0.349 

0.418 

15 

0.266 

0.304 

0.338 

0.404 

16 

0.258 

0.295 

0.328 

0.392 

17 

0.250 

0.286 

0.318 

0.381 

18 

0.244 

0.278 

0.3Q9 

0.371 

19 

0.237 

0.272 

0.301 

0.363 

20 

0.231 

0.264 

0.294 

0.356 

25 

0.21 

0.24 

0.27 

0.32 

30 

0.19 

0.22 

0.24 

0.29 

35 

0.18 

0.21 

0.23 

0.27 

>35 

1.077V-1/2 

1.22W-1/2 

1.367V-1/2 

1.63  A/-1/2 

Adapted  from  Massey, 
Fit,"  /.  Amer.  Stat.  Assoc. 
the  publisher. 


F.  J.,  Jr.,  "The  Kolmogorov-Smirnov  Test  for  Goodness  of 
46,  1951,  p.  70,  with  the  kind  permission  of  the  author  and 
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Table  B.8 

Significance  Test  for  the  Median  (the  Walsh  Test)  at  Stated  Significance 

Levels* 


N 

a 

Either 

Or 

5 

0.125 
0.062 

W*  +  d5)  <  0 
d5<0 

Krfi  +  A)  >  0 

di>0 

6 

0.094 
0.062 
0.031 

max[d5,  K^4  +  d&)]  <  0 
Kds  +  d*)<0 
de<0 

minfe  M^i  +  J3)]  >  0 
Wi  +  d2)>0 
di>0 

7 

0.109 
0.047 
0.031 
0.016 

maxfr/5,  Wa  +  di)]  <  0 
max[d6,  Kd5  +  di)]  <  0 
\(ds  +  di)<0 
di<0 

min[d3,  \{dx  +  fife)]  >  0 
min[</2,  £(</i  +  fife)]  >  0 
Kdi  +  d2)>0 
di>0 

8 

0.086 
0.055 
0.023 
0.016 
0.008 

max[</6,  \(d±  +  ds)]  <  0 

max[^6,  W*  +  d&)]  <  0 

maxfd?,  i(6?6  +  da)]  <  0 

K^7  +  d»)  <  0 

J8<0 

mm[d3,  Krfi  +  «/s)]  >  0 

min[</3,  Udi  +  A)]  >  0 

min[rf2,  £(</i  +  da)]  >  0 

KA  +  fife)  >  0 

rfi>0 

9 

0.102 
0.043 
0.020 
0.012 
0.008 

maxfr/e,  K^4  +  A)]  <  0 
max[di,  i(ds  +  A)]  <  0 
max[ds,  \{ds  +  d9)]  <  0 
max[ds,  \{di  +  d9)]  <  0 
Uds  +  ^9)  <  0 

min[</4,  £(</i  +  fife)]  >  0 
minfr/3,  Udi  +  A)]  >  0 
min[J2,  KA  +  A)]  >  0 
min[rf2,  KA  +  ^3)]  >  0 
Kdi  +  </■)  >  0 

For  continuation  of  Table  B.  8  see  next  page 


*  Adapted  from  Walsh,  John  E.,  "Applications  of  Some  Significance  Tests  for  the  Median 
Which  Are  Valid  Under  Very  General  Conditions,"  /.  Amer.  Stat.  Assoc,  44,  1949,  p.  343, 
with  the  kind  permission  of  the  author  and  the  publisher. 
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Table  B.8  (continued) 


N 

a 

Either 

Or 

10 

0.111 
0.051 
0.021 
0.010 

max[A,  i(A  +  dio)]  <  0 
max[A,  £(A  +  dio)]  <  0 
max[A,  HA  +  dio)]  <  0 
max[A,  lida  +  dio)]  <  0 

min[A,  HA  +  A)]  >  0 
min[A,  HA  +  A)]  >  0 
min[A,  |(A  +  A)]  >  0 
min[A,  HA  +  A)]  >  0 

11 

0.097 
0.056 
0.021 
0.011 

max[A,  HA  +  dn)]  <  0 
max[A,  HA  +  dn)]  <  0 
max[HA  +  dn),  HA  +  A)]  <  0 
max[A,HA  +  Ai)<0 

min[A,  HA  +  A)]  >  0 
min[A,  HA  +  di)]  >  0 
min[HA  +  A),  HA  +  A)]  >  0 
min[A,  HA  +  A)]  >  0 

12 

0.094 
0.048 
0.020 
0.011 

max[HA  +  A2),  HA  +  dn)]  <  0 
max[A,  HA  +  A  2)]  <  0 
max[A,  HA  +  A2)]  <  0 
max[HA  +  A2),  HA  +  Ao)]  <  0 

min[HA  +  A),  HA  +  A)]  >  0 
min[A,  J(A  +  A)]  >  0 
min[A,  HA  +  A)]  >  0 
min[i(A  +  A),  HA  +  A)]  >  0 

13 

0.094 
0.047 
0.020 
0.010 

max[HA  +  A3),  HA  +  dn)]  <  0 
max[HA  +  A3),  HA  +  A  2)]  <  0 
max[HA  +  A3),  HA  +  Ao)]  <  0 
max[Ao,  HA  +  A3)]  <  0 

min[i(A  +  Ao),  HA  +  A)]  >  0 
min[HA  +  A),  HA  +  A)]  >  0 
min[HA  +  fife),  HA  +  A)]  >  0 
min[A,  HA  +  A)]  >  0 

14 

0.094 
0.047 
0.020 
0.010 

max[HA  +  A4),  HA  +  dis)]  <  0 
max[HA  +  A  4),  HA  +  As)]  <  0 
max[Ao,  HA  +  di4)]  <  0 
maxfHA  +  A  4),  HAo  +  Ai)]  <  0 

min[HA  +  Ai),  HA  +  dio)]  >  0 
min[HA  +  dio),  HA  +  A)]  >  0 
min[A,  £(A  +  A)]  >  0 
min[HA  +  A),  HA  +  A)]  >  0 

15 

0.094 
0.047 
0.020 
0.010 

max[HA  +  As),  ^(^5  +  di4)]  <  0 
maxfHA  +  </i5),  HA  +  A4)]  <  0 
max[HA  +  As),  HAo  +  dn)]  <  0 
max[Ai,  HA  +  As)]  <  0 

min[HA  +  A 2),  i(A  +  Ai)]  >  0 
min[KA  +  Ai),  HA  +  Ao)]  >  0 
min[KA  +  Ao),  i(A  +  A)]  >  0 
min[A,  HA  +  A)]  >  0 

Reject  //o[/li  =  0]  against  two-tailed  alternative  Hi[/jl  ^  0]  at  significance  level  a 
if  either  of  the  corresponding  alternatives  is  true.  For  one-tailed  test  against  Hi[fx  <  0] 
use  column  3  with  significance  level  a/2.  For  one-tailed  test  against  Hilp  >  0]  use 
column  4  with  significance  level  a/2. 
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Table  B.9 
Critical  Values  of  U  for  the  Mann-Whitney  Test 
(a)  a  =  0.01  (one-tailed  test) 


\  M 

Af\ 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

3 

1 

1 

1 

2 

2 

2 

3 

3 

4 

4 

4 

5 

4 

3 

3 

4 

5 

5 

6 

7 

7 

8 

9 

9 

10 

5 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

6 

7 

8 

9 

11 

12 

13 

15 

16 

18 

19 

20 

22 

7 

9 

11 

12 

14 

16 

17 

19 

21 

23 

24 

26 

28 

8 

11 

13 

15 

17 

20 

22 

24 

26 

28 

30 

32 

34 

9 

14 

16 

18 

21 

23 

26 

28 

31 

33 

36 

38 

40 

10 

16 

19 

22 

24 

27 

30 

33 

36 

38 

41 

44 

47 

11 

18 

22 

25 

28 

31 

34 

37 

41 

44 

47 

50 

53 

12 

21 

24 

28 

31 

35 

38 

42 

46 

49 

53 

56 

60 

13 

23 

27 

31 

35 

39 

43 

47 

51 

55 

59 

63 

67 

14 

26 

30 

34 

38 

43 

47 

51 

56 

60 

65 

69 

73 

15 

28 

33 

37 

42 

47 

51 

56 

61 

66 

70 

75 

80 

(b)  a  =  0.05  (one-tailed  test) 


\ n2 

/vv\ 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

3 

3 

4 

5 

5 

6 

7 

7 

8 

9 

9 

10 

11 

4 

6 

7 

8 

9 

10 

11 

12 

14 

15 

16 

17 

18 

5 

9 

11 

12 

13 

15 

16 

18 

19 

20 

22 

23 

25 

6 

12 

14 

16 

17 

19 

21 

23 

25 

26 

28 

30 

32 

7 

15 

17 

19 

21 

24 

26 

28 

30 

33 

35 

37 

39 

8 

18 

20 

23 

26 

28 

31 

33 

36 

39 

41 

44 

47 

9 

21 

24 

27 

30 

33 

36 

39 

42 

45 

48 

51 

54 

10 

24 

27 

31 

34 

37 

41 

44 

48 

51 

55 

58 

62 

11 

27 

31 

34 

38 

42 

46 

50 

54 

57 

61 

65 

69 

12 

30 

34 

38 

42 

47 

51 

55 

60 

64 

68 

72 

77 

13 

33 

37 

42 

47 

51 

56 

61 

65 

70 

75 

80 

84 

14 

36 

41 

46 

51 

56 

61 

66 

71 

77 

82 

87 

92 

15 

39 

44 

50 

55 

61 

66 

72 

77 

83 

88 

94 

100 

Abridged  from  Auble,  D.,  "Extended  Tables  for  the  Mann-Whitney  Statistic," 
Bulletin  of  the  Institute  of  Educational  Research,  Indiana  University,  1,  no.  2,  1953, 
with  the  kind  permission  of  the  author  and  the  publisher. 


Table  B.10.  Jonckheere's  /:-Sample  Test 
Prob.  (S  >  So)  for  k  samples,  each  of  size  r 


k  -- 

=  3 

k  = 

=  3 

So 

So 

r  =  2 

r  =  4 

r  =  3 

r  =  5 

4 

0.2889 

0.4156 

9 

0.1940 

0.3396 

6 

0.1667 

0.3609 

11 

0.1387 

0.3025 

8 

0.0889 

0.3090 

13 

0.0946 

0.2672 

10 

0.0333 

0.2602 

15 

0.0613 

0.2340 

12 

0.0111 

0.2157 

17 

0.0369 

0.2032 

14 

0.1756 

19 

0.0208 

0.1748 

16 

0.1404 

21 

0.0107 

0.1489 

18 

0.1099 

23 

0.0048 

0.1256 

20 

0.0844 

25 

0.0018 

0.1049 

22 

0.0632 

27 

0.0006 

0.0867 

24 

0.0463 

29 

0.0708 

26 

0.0330 

31 

0.0572 

28 

0.0229 

33 

0.0456 

30 

0.0153 

35 

0.0359 

32 

0.0099 

39 

0.0214 

40 

0.0011 

43 
47 

0.0120 
0.0063 

k  =  4 

k  = 

=  5 

k  =  6 

So 

r  =  2 

r  =  3 

r  =  4 

r  =  2 

r  =  3 

r  =  2 

10 

0.1302 

0.2659 

0.3400 

0.2110 

0.3273 

0.2699 

12 

0.0829 

0.2220 

0.3069 

0.1625 

0.2921 

0.2265 

14 

0.0484 

0.1823 

0.2754 

0.1213 

0.2588 

0.1871 

16 

0.0262 

0.1472 

0.2454 

0.0878 

0.2274 

0.1521 

18 

0.0123 

0.1166 

0.2172 

0.0613 

0.1982 

0.1215 

20 

0.0052 

0.0907 

0.1910 

0.0412 

0.1713 

0.0953 

22 

0.0016 

0.0691 

0.1666 

0.0265 

0.1468 

0.0734 

24 

0.0004 

0.0515 

0.1443 

0.0162 

0.1247 

0.0553 

26 

0.0374 

0.1241 

0.0094 

0.1049 

0.0408 

28 

0.0266 

0.1058 

0.0051 

0.0874 

0.0294 

30 

0.0183 

0.0895 

0.0026 

0.0721 

0.0207 

32 

0.0123 

0.0751 

0.0012 

0.0588 

0.0142 

34 

0.0080 

0.0624 

0.0005 

0.0475 

0.0094 

36 

0.0050 

0.0514 

0.0379 

0.0061 

38 

0.0030 

0.0420 

0.0299 

0.0038 

40 

0.0017 

0.0339 

0.0234 

0.0023 

42 

0.0272 

0.0180 

44 

0.0215 

0.0137 

46 

0.0168 

0.0102 

48 

0.0130 

0.0075 

50 

0.0100 

0.0055 

Abridged  from  the  tables  in  Jonckheere,  A.  R.,  "A  Distribution-Free  /:-Sample 
Test  Against  Ordered  Alternatives,"  Biometrika,  41,  1954,  133-145,  by  kind  permission 
of  the  author  and  the  publishers. 
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Table  B.ll 
Values  of  Tanh  z'  =  r  (Fisher's  transformation) 


z' 

0 

1 

2 

3 

4 

6 

6 

7 

8 

9 

.00 

.0000 

0010 

0020 

0030 

0040 

0050 

0060 

0070 

0080 

0090 

.01 

.0100 

0110 

0120 

0130 

0140 

0150 

0160 

0170 

0180 

0190 

.03 

.0200 

0210 

0220 

0230 

0240 

0250 

0260 

0270 

0280 

0290 

.03 

.0300 

0310 

0320 

0330 

0340 

0350 

0360 

0370 

0380 

0390 

.04 

.0400 

0410 

0420 

0430 

0440 

0450 

0460 

0470 

0480 

0490 

.05 

.0500 

0510 

0520 

0530 

0539 

0549 

0559 

0569 

0579 

0589 

.06 

.0599 

0609 

0619 

0629 

0639 

0649 

0659 

0669 

0679 

0689 

.07 

.0699 

0709 

0719 

0729 

0739 

0749 

0759 

0768 

0778 

0788 

.08 

.0798 

0808 

0818 

0828 

0838 

0848 

0858 

0868 

0878 

0888 

.09 

.0898 

0907 

0917 

0927 

0937 

0947 

0957 

0967 

0977 

0987 

.10 

.0997 

1007 

1016 

1026 

1036 

1046 

1056 

1066 

1076 

1086 

.11 

.1096 

1105 

1115 

1125 

1135 

1145 

1155 

1165 

1175 

1184 

.12 

.1194 

1204 

1214 

1224 

1234 

1244 

1253 

1263 

1273 

1283 

.13 

.1293 

1303 

1312 

1322 

1332 

1342 

1352 

1361 

1371 

1381 

.14 

.1391 

1401 

1411 

1420 

1430 

1440 

1450 

1460 

1469 

1479 

.15 

.1489 

1499 

1508 

1518 

1528 

1538 

1547 

1557 

1567 

1577 

.16 

.1586 

1596 

1606 

1616 

1625 

1635 

1645 

1655 

1664 

1674 

.17 

.1684 

1694 

1703 

1713 

1723 

1732 

1742 

1752 

1761 

1771 

.18 

.1781 

1790 

1800 

1810 

1820 

1829 

1839 

1849 

1858 

1868 

.19 

.1877 

1887 

1897 

1906 

1916 

1926 

1935 

1945 

1955 

1964 

.20 

.1974 

1983 

1993 

2003 

2012 

2022 

2031 

2041 

2051 

2060 

.21 

.2070 

2079 

2089 

2098 

2108 

2117 

2127 

2137 

2146 

2156 

.22 

.2165 

2175 

2184 

2194 

2203 

2213 

2222 

2232 

2241 

2251 

.23 

.2260 

2270 

2279 

2289 

2298 

2308 

2317 

2327 

2336 

2346 

.24 

.2355 

2364 

2374 

2383 

2393 

2402 

2412 

2421 

2430 

2440 

.25 

.2449 

2459 

2468 

2477 

2487 

2496 

2506 

2515 

2524 

2534 

.26 

.2543 

2552 

2562 

2571 

2580 

2590 

2599 

2608 

2618 

2627 

.27 

.2636 

2646 

2655 

2664 

267") 

2683 

2692 

2701 

2711 

2720 

.28 

.2729 

2738 

2748 

2757 

2766 

2775 

2784 

2794 

2803 

2812 

.29 

.2821 

2831 

2840 

2849 

2858 

2867 

2876 

2886 

2895 

2904 

.30 

.2913 

2922 

2931 

2941 

2950 

2959 

2968 

2977 

.2986 

2995 

.31 

.3004 

3013 

3023 

3032 

3041 

3050 

3059 

3068 

3077 

3086 

.32 

.3095 

3104 

3113 

3122 

3131 

3140 

3149 

3158 

3167 

3176 

.33 

.3185 

3194 

3203 

3212 

3221 

3230 

3239 

3248 

3257 

3266 

.34 

.3275 

3284 

3293 

3302 

3310 

3319 

3328 

3337 

3346 

3355 

.35 

.3364 

3373 

3381 

3390 

3399 

3408 

3417 

3426 

3435 

3443 

.36 

.3452 

3461 

3470 

3479 

3487 

3496 

3505 

3514 

3522 

3531 

.37 

.3540 

3549 

3557 

3566 

3575 

3584 

3592 

3601 

3610 

3618 

.38 

.3627 

3636 

3644 

3653 

3662 

3670 

3679 

3688 

3696 

3705 

.39 

.3714 

3722 

3731 

3739 

3748 

3757 

3765 

3774 

3782 

3791 

.40 

.3799 

3808 

3817 

3825 

3834 

3842 

3851 

3859 

3868 

3876 

.41 

.3885 

3893 

3902 

3910 

3919 

3927 

3936 

3944 

3952 

3961 

.42 

.3969 

3978 

3986 

3995 

4003 

4011 

4020 

4028 

4036 

4045 

.43 

.4053 

4062 

4070 

4078 

4087 

4095 

4103 

4112 

4120 

4128 

.44 

.4136 

4145 

4153 

4161 

4170 

4178 

4186 

4194 

4203 

4211 

.45 

.4219 

4227 

4235 

4244 

4252 

4260 

4268 

4276 

4285 

4293 

.46 

.4301 

4309 

4317 

4325 

4333 

4342 

4350 

4358 

4366 

4374 

.47 

.4382 

4390 

4398 

4406 

4414 

4422 

4430 

4438 

4446 

4454 

.48 

.4462 

4470 

4478 

4486 

4494 

4502 

4510 

4518 

4526 

4534 

.49 

.4542 

4550 

4558 

4566 

4574 

4582 

4590 

4598 

4605 

4613 

♦Reproduced  from  Numerical  Tables,  by  J.  W.  Campbell,  University  of  Alberta,  by  kind 
permission  of  Mrs.  Campbell. 
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Table  B.ll  (cont.) 
Values  of  Tanh  z'  =  r  (Fisher's  transformation) 


2' 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.50 

.4621 

4629 

4637 

4645 

4653 

4660 

4668 

4676 

4684 

4692 

.61 

.4699 

4707 

4715 

4723 

4731 

4738 

4746 

4754 

4762 

4769 

.52 

.4777 

4785 

4792 

4800 

4808 

4815 

4823 

4831 

4839 

4846 

.53 

.4854 

4861 

4869 

4877 

4884 

4892 

4900 

4907 

4915 

4922 

.54 

.4930 

4937 

4945 

4953 

4960 

4968 

4975 

4983 

4990 

4998 

.55 

.5005 

5013 

6020 

6028 

6035 

6043 

6050 

6057 

6065 

6072 

•56 

.5080 

6087 

6095 

6102 

6109 

5117 

6124 

6132 

5139 

5146 

.57 

.5154 

5161 

5168 

6176 

6183 

5190 

5198 

6205 

6212 

5219 

.58 

.5227 

6234 

6241 

5248 

6256 

5263 

6270 

6277 

6285 

6292 

.59 

.5299 

6306 

6313 

6320 

6328 

6335 

6342 

6349 

6356 

6363 

.60 

.5370 

6378 

6385 

6392 

6399 

6406 

6413 

6420 

6427 

6434 

.61 

.5441 

6448 

5455 

6462 

5469 

6476 

6483 

6490 

6497 

550* 

.62 

.5511 

5518 

5525 

5532 

5539 

5546 

5553 

6560 

6567 

5574 

.63 

.5581 

5587 

5594 

5601 

6608 

5615 

5622 

5629 

5635 

5642 

.64 

.5649 

6656 

5663 

6669 

6676 

5683 

6690 

5696 

5703 

5710 

.65 

.5717 

6723 

6730 

5737 

5744 

5750 

6757 

6764 

6770 

6777 

.66 

.5784 

5790 

6797 

5804 

5810 

6817 

5823 

5830 

6837 

6843 

.67 

.5850 

6856 

6863 

5869 

6876 

6883 

6889 

5896 

5902 

6909 

.68 

.5915 

6922 

5928 

5935 

6941 

5948 

5954 

5961 

5967 

6973 

.60 

.5980 

5986 

6993 

6999 

6005 

6012 

6018 

6025 

6031 

6037 

.70 

.6044 

6050 

6056 

6063 

6069 

6075 

6082 

6088 

6094 

6100 

.71 

.6107 

6113 

6119 

6126 

6132 

6138 

6144 

6150 

6157 

6163 

.72 

.6169 

6175 

6181 

6188 

6194 

6200 

6206 

6212 

6218 

6225 

.73 

.6231 

6237  • 

6243 

6249 

6255 

6261 

6267 

6273 

6279 

6285 

.74 

.6291 

6297 

6304 

6310 

6316 

6322 

6328 

6334 

6340 

6346 

.75 

.6351 

6357 

6363 

6369 

6375 

6381 

6387 

6393 

6399 

6405 

.76 

.6411 

6417 

6423 

6428 

6434 

6440 

6446 

6452 

6458 

6463 

.77 

.6469 

6475 

6481 

6487 

6492 

6498 

6504 

6510 

6516 

6521 

.78 

.6527 

6533 

6539 

6544 

6550 

6556 

6561 

6567 

6573 

6578 

.70 

.6584 

6590 

6595 

6601 

6607 

6612 

6818 

6624 

6629 

6635 

.80 

.6640 

6646 

6652 

6657 

6663 

6668 

6674 

6679 

6685 

6690 

.81 

.6696 

6701 

6707 

6712 

6718 

6723 

6729 

8734 

6740 

6745 

.82 

.6751 

6756 

6762, 

6767 

6772 

6778 

6783 

6789 

6794 

6799 

.83 

.6805 

6810 

6815 

6821 

6826 

6832 

6837 

6842 

6847 

6853 

.84 

6858 

6863 

6869 

6874 

6879 

6884 

6890 

6895 

6900 

6905 

.85 

.6911 

6916 

6921 

6926 

6932 

6937 

6942 

6947 

6952 

6967 

.88 

.6963 

6968 

6973 

6978 

6983 

6988 

6993 

6998 

7004 

7009 

.87 

.7014 

7019 

7024 

7029 

7034 

7039 

7044 

7049 

7054 

7059 

.88 

.7064 

7069 

7074 

7079 

7084 

7089 

7094 

7099 

7104 

7109 

.80 

.7114 

7119 

7124 

7129 

7134 

7139 

7143 

7148 

7153 

7158 

.00 

.7163 

7168 

7173 

7178 

7182 

7187 

7192 

7197 

7202 

7207 

.01 

.7211 

7216 

7221 

7226 

7230 

7235 

7240 

7245 

7249 

7254 

.02 

.7259 

7264 

7268 

7273 

7278 

7283 

7287 

7292 

7297 

7301 

.03 

.7306 

7311 

7315 

7320 

7325 

7329 

7334 

7338 

7343 

7348 

.04 

.7352 

7357 

7361 

7368 

7371 

7375 

7380 

7384 

7389 

7393 

.05 

.7398 

7402 

7407 

7411 

7416 

7420 

7425 

7429 

7434 

7438 

.06 

.7443 

7447 

7452 

7456 

7461 

7465 

7469 

7474 

7478 

7483 

.07 

.7487 

7491 

7496 

7500 

7505 

7509 

7513 

7518 

7522 

7526 

.08 

.7531 

7535 

7539 

7544 

7548 

7552 

7557 

7561 

7565 

7569 

.00 

.7574 

7578 

7582 

7586 

7591 

7595 

7599 

7603 

7608 

7612 
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Table  B.ll  (cont.) 
Values  of  Tanh  z'  =  r  (Fisher's  transformation) 


2' 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

1.00 

.7616 

7620 

7624 

7629 

7633 

7637 

7641 

7645 

7649 

7653 

1.01 

.7658 

7662, 

7666 

7670 

7674 

7678 

7682 

7686 

7691 

7695 

1.09 

.7699 

7703 

7707 

7711 

7715 

7719 

7723 

7727 

7731 

7735 

1.03 

.7739 

7743 

7747 

7751 

7755 

7759 

7763 

7767 

7771 

7775 

1.04 

.7779 

7788 

7787 

7791 

7795 

7799 

7802 

7806 

7810 

7814 

1.05 

.7818 

7822 

7826 

7830 

7834 

7837 

7841 

7845 

7849 

7853 

1.06 

.7857 

7860 

7864 

7868 

7872 

7876 

7879 

7883 

7887 

7891 

1.07 

.7895 

7898 

7902 

7906 

7910 

7913 

7917 

7921 

7925 

7928 

1.08 

.7932 

7936 

7939 

7943 

7947 

7950 

7954 

7958 

7961 

7965 

1.09 

.7969 

7972 

7976 

7980 

7983 

7987 

7991 

7994 

7998 

8001 

1.10 

.8005 

8009 

8012 

8016 

8019 

8023 

8026 

8030 

8034 

8037 

1.11 

.8041 

8044 

8048 

8051 

8055 

8058 

8062 

8065 

8069 

8072 

1.13 

.8076 

8079 

8083 

8086 

8090 

8093 

8096 

8100 

8103 

8107 

1.13 

.8110 

8114 

8117 

8120 

8124 

8127 

8131 

8134 

8137 

8141 

1.14 

.8144 

8148 

8151 

8154 

8158 

8161 

8164 

8168 

8171 

8174 

1.15 

.8178 

8181 

8184 

8187 

8191 

8194 

8197 

8201 

8204 

8207 

1.16 

.8210 

8214 

8217 

8220 

8223 

8227 

8230 

8233 

8236 

8240 

1.17 

.8243 

8246 

8249 

8252 

8256 

8259 

8262 

8265 

8268 

8271 

1.18 

.8275 

8278 

8281 

8284 

8287 

8290 

8293 

8296 

8300 

8303 

1.19 

.8306 

8309 

8312 

8315 

8318 

8321 

8324 

8327 

8330 

8333 

1.90 

.8337 

8340 

8343 

8346 

S349 

83.52 

8355 

8358 

8361 

8364 

1.91 

.8367 

8370 

8373 

8376 

8379 

8382 

8385 

8388 

8391 

8394 

1.99 

.8397 

8399 

8402 

8405 

8408 

8411 

8414 

8417 

8420 

8423 

1.93 

.8426 

8429 

8432 

8434 

8437 

8440 

8443 

8446 

8449 

8452 

1.94 

.8455 

8467 

8460 

8463 

8466 

8469 

8472 

8474 

8477 

8480 

1.95 

.8483 

8486 

8488 

8491 

8494 

8497 

8500 

8502 

8505 

8508 

1.96 

.8511 

8513 

8516 

8519 

8522 

8524 

8527 

8530 

8533 

8535 

1.97 

.8538 

8541 

8543 

8546 

8549 

8551 

8554 

8557 

8560 

8562 

1.28 

.8565 

8568 

8570 

8573 

8575 

8S78 

8581 

8583 

8586 

8589 

1.99 

.8591 

8594 

8596 

8599 

8602 

8604 

8607 

8609 

8612 

8615 

1.30 

.8617 

8620 

8622 

8625 

8627 

8630 

8633 

8635 

8638 

8640 

1.31 

.8643 

8645 

8648 

8650 

8653 

8655 

8658 

8660 

8663 

8665 

1.39 

.8668 

8670 

8673 

8675 

8678 

8680 

8683 

8685 

8688 

8690 

1.33 

,8692 

8695 

8697 

8700 

8702 

8705 

8707 

8709 

8712 

8714 

1.34 

.8717 

8719 

8722 

8724 

8726 

8729 

8731 

8733 

8736 

8738 

1.35 

,8741 

8743 

8745 

8748 

8750 

8752 

8755 

8757 

8759 

8762 

1.36 

.8764 

8766 

8769 

8771 

8773 

8775 

8778 

8780 

8782 

8785 

1.37 

.8787 

8789 

8791 

8794 

8796 

8798 

8801 

8803 

8805 

8807 

1.38 

.8810 

8812 

8814 

8816 

8818 

8821 

8823 

8825 

8827 

8830 

1.39 

.8832 

8834 

8836 

8838 

8840 

8843 

8845 

8847 

£849 

8851 

1.40 

.8854 

8856 

8858 

8860 

8862 

8864 

8866 

8869 

8871 

8873 

1.41 

.8875 

8877 

8879 

8881 

8883 

8886 

8888 

8890 

8892 

8894 

1.49 

.8806 

8898 

8900 

8902 

8904 

8906 

8908 

8911 

8913 

8915 

1.43 

.8917 

8919 

8921 

8923 

8925 

8927 

8929 

8931 

8933 

8935 

1.44 

.8937 

8939 

8941 

8943 

8945 

8947 

8949 

8951 

8953 

8955 

1.45 

.8957 

8959 

8961 

8963 

8965 

8967 

8969 

8971 

8973 

8975 

1.46 

.8977 

8978 

8980 

8982 

8984 

8986 

8988 

8990 

8992 

8994 

1.47 

.8996 

8998 

9000 

9001 

9003 

9005 

9007 

9009 

9011 

9013 

1.48 

.9015 

9017 

9018 

9020 

9022 

9024 

9026 

9028 

9030 

9031 

1.49 

.9033 

9035 

9037 

9039 

9041 

9042 

9044 

9046 

9048 

9050 
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Table  B.ll  (cont.) 
Values  of  Tanh  z'  =  r  (Fisher's  transformation) 


J 

0 

1 

2 

3 

4 

6 

6 

7 

8 

9 

1.50 

.9051 

9053 

9055 

9057 

9059 

9060 

9062 

9064 

9066 

9068 

1.61 

.9069 

9071 

9073 

9075 

9076 

9078 

9080 

9082 

9083 

9085 

1.52 

.9087 

•9089 

9090 

9092 

9094 

9096 

9097 

9099 

9101 

9103 

1.53 

.9104 

9106 

9108 

9109 

9111 

9113 

9114 

9116 

9118 

9120 

1.54 

.9121 

9123 

9125 

9126 

9128 

9130 

9131 

9133 

9135 

9136 

1.55 

.9138 

9140 

9141 

9143 

9144 

9146 

9148 

9149 

9151 

9153 

1.56 

.9154 

9156 

9167 

9159 

9161 

9162 

9164 

9165 

9167 

9169 

1.57 

.9170 

9172 

9173 

9175 

9177 

9178 

9180 

9181 

9183 

9184 

1.58 

.9186 

9188 

9189 

9191 

9192 

9194 

9195 

9197 

9198 

9200 

1.59 

.9201 

9203 

9205 

9206 

9208 

9209 

9211 

9212 

9214 

9215 

1.60 

.9217 

9218 

9220 

9221 

9223 

9224 

9226 

9227 

9229 

9230 

1.61 

.9232 

9233 

9235 

9236 

9237 

9239 

9240 

9242 

9243 

9245 

1.62 

.9246 

9248 

9249 

9251 

9252 

9253 

9255 

9256 

9258 

9259 

1.63 

.9261 

9262 

9263 

9265 

9266 

9268 

9269 

9271 

9272 

9273 

1.64 

.9275 

9276 

9278 

9279 

9280 

9282 

9283 

9284 

9286 

9287 

1.65 

.9289 

9290 

9291 

9293 

9294 

9295 

9297 

9298 

9299 

9301 

1.66 

.9302 

9304 

9305 

9306 

9308 

9309 

9310 

9312 

9313 

9314 

1.67 

.9316 

9317 

9318 

9319 

9321 

9322 

9323 

9325 

9326 

9327 

1.68 

.9329 

9330 

9331 

9332 

9334 

9335 

9336 

9338 

9339 

9340 

1.69 

.9341 

9343 

9344 

9345 

9347 

9348 

9349 

9350 

9352 

9353 

1.70 

.9354 

9355 

9357 

9358 

9359 

9360 

9362 

9363 

9364 

9365 

1.71 

.9366 

9368 

9369 

9370 

9371 

9373 

9374 

9375 

9376 

9377 

1.72 

.9379 

9380 

•9381 

9382 

9383 

9385 

9386 

9387 

9388 

9389 

1.73 

.9391 

9392 

9393 

9394 

9395 

9396 

9398 

9399 

9400 

9401 

1.74 

.9402 

9403 

9405 

9406 

9407 

9408 

9409 

9410 

9411 

9413 

1.75 

.9414 

9415 

9416 

9417 

941S 

9419 

9421 

9422 

9423 

9424 

1.76 

.9425 

9426 

9427 

9428 

9429 

9431 

9432 

9433 

9434 

9435 

1.77 

.9436 

9437 

9438 

9439 

9440 

9442 

9443 

9444 

9445 

9446 

1.78 

.9447 

9448 

9449 

9450 

9451 

9452 

9453 

9454 

9455 

9457 

1.79 

.9458 

9459 

9460 

9461 

9462 

9463 

9464 

9465 

9466 

9467 

1.8 

.9468 

9478 

9488 

9498 

9508 

9517 

9527 

9536 

9545 

9554 

1.9 

.9562 

9571 

9579 

9587 

9595 

9603 

9611 

9618 

9626 

9633 

2.0 

.9640 

9647 

9654 

9661 

9667 

9674 

9C80 

9687 

9693 

9699 

2.1 

.9705 

9710 

9716 

9721 

9727 

9732 

9737 

9743 

9748 

9753 

2.2 

.9757 

9762 

9767 

9771 

9776 

9780 

9785 

9789 

9793 

9797 

2.3 

.9801 

9805 

9809 

9812 

9816 

9820 

9823 

9827 

9830 

9833 

2.4 

.9837 

9840 

9843 

9846 

9849 

9852 

9855 

9858 

9861 

9863 

2.5 

.9866 

9869 

9871 

9874 

9876 

9879 

9881 

9884 

9886 

9888 

2.6 

.9890 

9892 

9895 

9897 

9899 

9901 

9903 

9905 

9906 

9908 

2.7 

.9910 

9912 

9914 

9915 

9917 

9919 

9920 

9922 

9923 

9925 

2.8 

.9926 

9928 

9929 

9931 

9932 

9933 

9935 

9936 

9937 

9938 

2.9 

.9940 

9941 

9942 

9943 

9944 

9945 

9946 

9947 

9949 

9950 

3. 

.9951 

9959 

9967 

9973 

9978 

9982 

9985 

9988 

9990 

9992 

4. 

.9993 

9995 

9996 

9996 

9997 

9998 

9998 

9998 

9999 

9999 

ANSWERS  TO  PROBLEMS 

Chapter  1 

A      fh  80  -i-  (3)  ±  £  JL-  (d\  ±-  n\  _i_  J.  _i l.  i   jl 

B.    (1)  3125,  1024;  (2)  120;  (3)  360;  (4)  480;  (8)  20;  (9)  (4n2  -  6n  +  4) 

(\2(1)  0,055;  (2)  A;  (3)  *;  (4)  ff)/(T)  =  ^   (?)(?)/ 
(1(J°)  =0.0423;   (5)   0.638;   (6)   0.665;   (7)   |f;   (10)    1  -  Jj.a  (-1)*/*!; 

(11)  Efc2o(-l)fc(^)(50  -  fc)!/52!  =  0.548;  (12)  e"1  =  0.368,  0;  (13)0.096, 

0.497,  0.407;  (14)  ■*■*-;  (15) ? . 

D.  (l)"37.5c;4(2)   $  1-^V (3)  (l/2)*+1,  1;  (4)  (1  -  p)/p;  (5)  13;  (6)  7«/2; 
(7)  $2.67;  (8)  $6;  (9)  (a)  2  (b)  43  (c)  223  (d)  103  (e)  365  (f)  264. 

E.  (1)  i;  (2)  0.326;  (3)  ±;  (5)  J;  (6)  2K/3. 


Chapter  2 

A.  (3)  0.683;  (4)  M  =  7.26%  gi  =  6.39%,  03  =  8.22%  P.R.  =  79.6; 
(5)  19.5d.,  7.8d.,  74.3;  (6)  32,  5.2;  (8)  -0.104,  1.861,  0.111-,  x  =  7.35,  k2  =  1.86, 
m3/m2312  =  0.27. 

B.  (1)  ^  2.16,  0.88,  -0.14;  (2)  ^,  2.184;  (3)  16/tt,  |,  <£_;  (4)  ±,  3,  3, 
-31og(l  -  h);  (5)  (c*  -  l)2//*2;  (6)  tflog[l  +  />(**  -  1)],  #A  Aw,  Npq 
(q  -  p),  AMI  -  6pq);  (10)  aff(a  -  1),  a/?2/[(a  -  l)2(a  -  2)];  (11)  (b)"1/fl, 
[(a  -  \)/{b(a  +  \)}]1/a;  (12)  2;  (13)  6  =  (1  +  x)"1. 

C.  (1)  |,  i;  (2)  |;  (3)  (a)  0,  (b)  e~3  =  0.05,  Chebyshev  inequalities  (a) 
P  <  A,  (b)  P  <  i. 

Chapter  3 

A.  (1)  0.774;  (2)  (a)  f§i,  (b)  £J;  (4)  0.063;  (5)  0.468;  (6)  n  >  69;  (8)  C(nu 
n2)  =  -n9(\  -  0),  KC/iJ/i  -  n2/n)  =  40(1  -  0)/w. 

B.  (1)  ^;  (2)  AA  >  6250;  (5)  £(X)  =  f,  K(Z)  =  ^. 

C.  (1)  2e~2  =  0.271 ;  (2)  2.303;  (3)  0.423;  (4)  P(ll,5)  =  0.0137;  (5)  0.0915, 
0.216;  (6)  (a)  0.143,  (b)  0.053,  (c)  6  or  more. 

D.  (1)  0.0863,  0.3251,  0.9808,  0.0515 ;  (2)  (a)  1.1505,  (b)  -0.1764;  (3)  0.586; 
(4)  (a)  75.8,  (b)  1037;  (5)  20;  (6)  793,  4207;  (7)  74.3,  3.23;  (8)  125;  (10)  0.219, 
0.200,  0.214;  (11)  0.217,  0.625. 

E.  (3)  V(X)=  14.2,  14.3,  44.7,  47.5,  102.0,  V(A)  =  9.73,  11.38,  24.57, 
22.02,  38.82. 
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Chapter  4 

A.  (1)  i,  A,  0,  -t^;  (2)  7  =  (X  -  l)2/4;  (3)  e~\  0  <  w  <  go  ;  (4) 
/(i>)/(2ai0,  »  =  [(u  -  6)/a]1/2;  (5)  g(u)  =  2tT1/2/9  (0  <  u  <  1),  </(*/)  -  (1  +  tT1/2)/9 
(1  <  w  <  4);  (6)  (a)  F(\og  x),  0  <  x  <  go,  (b)  J]  «...  [F(2/itc  +  sin"1*)  - 
F((2n  -  1)tt  -  sin-1*)],  -1  <  x  <  1,  (c)  x,  0  <  x  <  1. 

B.  (3)  (a)  (|)  T  (|),  (b)  e6  r(8)/(3  •  67) ;  (4)  [B&  (N  -  3)/2)] " 1 ; (9)  (m  -  1 )/«. 

C.  (5)  £(JT)  =  e1/2,  F(Z)  =  e(e  -  1);  (6)  0.2404. 

Chapter  5 

A.  (1)  t1  =  0.0510,  t2  =  1,5530,  90%  limits  0.026  to  0.784;  (2)  ty  =  0.736, 
t2  =  1.264,  95%  limits  0.936  to  1.464;  (3)  X  ±  233aN~1/2;  (8)  5  to  25.3,  com- 
pared with  5  to  39.2  from  problem  (7);  (9)  (b)  r  ~  |W(1  -  9) 

B.  (2)  0.725,  0.753,  0.613,  -0.046;  (3)  0.061,  0.074,  C(ku  k2)  =  0.0031; 
(4)0.046,0.091. 

C.  (1)  No,  P(\X  -  20|  >  1.8)  =  0.027;  (2)  12.28  to  12.38  sec,  17;  (3)  Yes, 
P(\x2  -x1\>5)  =  0.009;  (4)  0.08  to  0.17,  0.087  to  0.175;  (5)  No,  P  =  0.17; 
(6)  0.38  to  0.78;  (7)  0.06  to  0.27;  (8)  hypothesis  not  rejected,  P  =  0.11;  (9)  62 
or  more. 

Chapter  6 

A.  (8)  V(s)/V(d  V-^/2)  =  0.876;  (11)  6  =  2(m  -  l)/m;  (12)  Np{\  -  p2)  - 

/KL*2  +  I>>2)  +  (i+/52)I>  =  o. 

B.  (1)  191;  (2)  reject  H0  if  m  >  75  +  2.79A^"1/2,  0.14,  0.57,  0.92,  23; 

(3)  reject  Ht  if  m  >  k,  where  k  is  given  by  P(\0k,  10)  <  0.05,  k  =  1.6, 
power  =  P(16, 20)  =  0.84;  (4)  x  >  c,  where  a  «  B(c,  n,  0O),  1  -  P  «  £(c, «,  0X), 
«  =  50;  (5)  |/|  >  c,  t  =  N1/2(m  —  p0)ls,  s  =  sample  standard  deviation; 
(6)  P  =  I  (*,  -  x)0>i  -  fl/E  (xf  -  3c)2,  *  =  j;  -  jfe,  Atf2  =  £  {(y,  -  y)- 
fe-x)}2;(7)6561. 

C.  (2)  IJ°^a/Zl°«3^  ttt;  (3)  0.898;  (5)  ««)  =  1  -  0(2c),  0«)  = 
0(2c  -  4),  0.46;  (7)  accept  if  JJ0  0r(l  -  0)N  ~r £(0)^(6)  d9  <  J*o0r(l  -  6)N~r 
f(0)L2(0)  dO. 

Chapter  7 

A.  (1)  375,  625;  (2)  (a)  20,  23,  19,  17,  8,  6,  7,  (b)  10,  18,  17,  19,  12,  9,  15, 
(a)  180%,  (b)  216%;  (4)  2  -  2x0  =  (2  -  x0)e~x\  x0  =  0.644. 

B.  (1)  1.58,  1.54;  (4)  V(T/Nl)  =  1.00. 

C.  (3)  V(mt)  =  11.62,  variance  within  samples  =  161.6,   V(mr)  =  30.66; 

(4)  0.406. 

D.  (3)  V(f)  =  23.45  x  106;  Vr(f)  =  23  J  x  106;  (4)  ^V/Q  =a2k2\C2 
=  Ma2  -  M.a,2  -  M2g22)I{MC0\  N  =  [M2o2  +  MM^2{h  -  1)  +  MM2a22 
(k  -  1)]/(M<72  +  £2);  (5)  /z  =  1.23,  k  =  2.36,  N  =  487,  expected  cost  =  279, 
for  random  sample  N'  =  302,  C  =  319. 

E.  (1)  55,  11;  (3)  29  for  H0,  24  for  Hx\  (7)  E0(n)  =  [a  log  A  +  (1  -  a) 
log  5]/[/i0  -  /*i  +  /^o  log(^i/iUo)],     £"i(«)  =  [(1  -  £)log  ^  +  P  log  5]/[/i0  -  /it 
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Chapter  8 

A.  (2)  690  or  more;  (3)  0.308;  (4)  62;  (6)  KQi)  =  -{n/2)  log(l  -  2a2h/n)t 
E(k2)  =  <72,  V(k2)  =  2<7>;  (7)  n/(n  +  2). 

B.  (1)  55.2  to  59.3;  (2)  179.1  to  182.9;  (3)  Yes,  at  5%  level,  P  =  <f>.04; 

(4)  Yes,  P  =  0.10;  (5)  -5.3  to  18.3  mm;  (6)  -3.2  to  7.4  lb;  (7)  No,  t  =  1.56; 
(8)  Yes,  P  =  0.02;  (9)  (a)  increase  barely  significant,  t  =  2.11,  (b)  highly  sig- 
nificant, /  =  5.66;  (12)  0.69;  (13)  8.4;  (15)  accept  lot  if  m  +  3.36s  <  3.0, 
S  =  6.20,  k  =  3.36. 

C.  (1)  12.6  to  168 ;  (2)  highly  significant,  F  =  23 ;  (3)  No,  F  =  2.0;  (4)  Yes, 
F  =  2.10;  (7)  6.94;  (8)  10.1;  (9)  F  =  1.36,  X  =  1.59. 

D.  (1)  276,  28 1 ;  (2)  Yes,  t  <  0.005 ;  (3)  Yes,  t  <  0.005 ;  (4)  (a)  significant  at 
5%  level  but  not  at  1%,  (b)  not  at  5%;  (5)  R  =  12.5;  &  =  4.06,  (7)  g(R)  = 
(N  -  l)e~R(\  -  e~R)N'2;  (8)  0.35  to  4.78;  (9)  b(N  -  l)/(N  +  1),  2b\N  -  1)/ 
[(N+  l)2(Ar+2)];(10)8. 

Chapter  9 

A.  (1)  homogeneity  accepted,  M/c  =  4.13;  (4)  highly  significant,  F  =  8.3; 

(5)  No,  F  =  3.33. 

B.  (1)  variety  effect  almost  sig.  at  1  %  level,  F  =  3.11,  block  effect  sig.  at 
5%F  =  2.76;  (2)  Ffor  fertilizers  =  71.4;(3)F(makes)  =  34.1,  F(cities)  =  7.43, 
F(interaction)  =  23.2;  (4)  (a)  -8.5,  0.5,  0.5,  7.5,  (b)  -1.07,  -0.80,  1.90, 
^0.04;  (5)  (a)  0.002,  -0.178,  0.140,  0.120,  -0.058,  0.008,  0.100,  -0.006, 
-0.128,  (b)  -0.086,  0.090,  -0.053,  -0.020,  0.069;  (6)  (a)  0.17,  -1.05,  1.07, 
-1.62,  1.41,  (b)  0.37,  -0.54,  0.16(c)  ytj  =  -0.56,  1.62,  -1.05,  y2j  =  -0.84, 
1.24,  -0.40,  ?3,.  =  0.54,  -0.75,  0.22,  $4j  =  -1.24,  1.31,  -0.06,  %j  =  2.10, 
-3.40,  1.31;  (8)  d2  =  25.9,  a2  =  37.8,  correlation  coefficient  =  0.59,  power 
«  0.28;  (9)  6.8,  9.3,  2.6,  4.5. 

C.  (1)  0.55,  0.195,  3.37,  0.455,  ^(makes)  =  1.47;  (3)  city  effect  highly 
significant,  F  =  1.56,  box  effect  non-sig.,  F  =  1.2,  a2  =  0.197,  &p2  =  0.0005, 
a2  =  0.019;  (5)  row  and  column  effects  non-sig.,  treatment  effect  highly  sig., 
F  =  11.1. 

D.  (1)  ^(detergents,  elim.  blocks)  =  27,  F(blocks,  elim.  detergents)  =  3.95, 
both  sig.  at  1%;  (2)  0.307,  B  and  C  only;  (3)  5C,  BD  and  BG;  (4)  S(r  =  4 
almost  sufficient). 

Chapter  10 

A.  (1)  P  =  0.40;  (2)  No,  P  =  0.66;  (3)  not  at  5  %  level,  P  =  0.02;  (4)  Yes, 
P  <  0.01;  (5)  No,  P  =  0.007;  (6)  fit  satisfactory,  P  =  0.14;  (7)  P  =  0.42; 
(8)  P  =  0.85;  (9)  P  =  0.17;  (10)  (a)  y2  =  17.9,  P  =  0.003,  (b)  X2  =  10.3, 
P  =  0.035,  (c)  x2  =  7.96,  P  =  0.046. 

B.  (1)  DN  =  0.549;  (2)  not  quite  sig.  at  5%;  (3)  Yes,  N1/2DN  =  0.86, 
(4)  Yes,  max  \SN(x)  -  F(x)\  =  0.092. 

C.  (1)  Yes,  P  =  0.51 ;  (2)  No,  T  =  18.5;  (3)  No,  z  «  0.50;  (4)  null  hy- 
pothesis rejected  at  about  2%  level;  (5)  No,  P(one-tailed)  =  0.26. 
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D.  (1)  No;  (2)  randomness  not  rejected,  P  >  0.20;  (3)  number  of  runs 
=  expected  number  (33),  distribution  of  runs  not  random  by  y2  test ;  (4)  z  &  3.52, 
highly  sig.;  (5)  Yes,  S  =  51,  P  <  0.006. 

Chapter  11 

A.  (1)  (a)  g(x)  =  2(a  -  x)/a2,  0  <  *  <  a,  h(y)  =  2y/a2,  0  <  y  <  a, 
i/x  =  (a  +  x)/2,  £y  =  y/2,  fix  =  a/3,  Hy  =  2a/3,  ax2  =  a2/18,  aY2  =  a2/18, 
aXY  =  a2 136,  p  =  1/2;  (2)  necessary  but  not  sufficient;  (9)  1/2. 

B.  (1)  yc  =  0.886*  -  0.57,  xc  =  0.825j>  +  8.55;  (2)  0.855,  0.62  to  1.15, 
0.58  to  1.07;  (3)  125,  80,  15.1,  9.05,  75.0,  0.55;  (4)  yc  =  0.070*  +  58.2,  0.052 
to  0.087,  0.487;  (5)  No,  t  =  1.91;  (6)  20;  (7)  No,  relative  accuracy  =  1.19; 
(8)  64,  4.0;  (9)  16,  13,  17,  0.60. 

C.  (1)  yc  =  16.58  -  1.27*,  1923;  (2)  33.9  in.,  24.5  to  43.9  (approximation 
26.7  to  41.1);  (3)  p  =  3.214,  &  =  6.726,  62  =  0.00062;  (4)  (a)  yc  =  3.224*  + 
6.668,  (b)  yc  =  3.214*  +  6.723,  .(a)  3.199  to  3.263,  (b)  3.182  to  3.246, 
a2  =  0.00071. 

D.  (1)  Yes,  t  =  -3.43,  No,  P  =  0.07;  (2)  0.319  to  0.774,  No  (P  =  0.11); 
(3)  No,  P  =  0.37;  (4)  0.705;  (5)  p  «  r(l  -  l/(2/i)). 

E.  (1)  rs  =  0.796,  rK  =  0.554;  (2)  rP  =  0.636,  rs  =  0.733;  (3)  hypothesis 
of  independent  random  rankings  not  rejected,  P  =  0.14;  rs  =  0.624,  rK  =  0.422, 
agreement  not  sig.  at  -10%  level. 

F.  (1)  Xs2  =  76,  highly  sig.,  C  =  0.32;  (2)  Yes,  P  =  0.01;  (3)  No,  (a) 
P  =  0.24,  (b)  P  =  0.23;  (4)  No,  P  «  0.10;  (5)  Yes,  P  <  0.01. 

Chapter  12 

A.  (3)  yc  =  3.37*  +  0.00364z  +  9.30;  (4)  yc  =  17.19  -  0.081*  -  0.342z; 
(S)ye  =  -1.0051  -  0.000065*!  +  0.003894*2  +  0.32505*3. 

B.  (1)  rows  of^"1  are  (0.3958,  -0.00377,  -0.03651),  (-0.00377,  0.000286, 
-0.000167),  (-0.03651,  -0.000167,  0.00482),  d2  =  2.34;  (2)  15.56  <  P0  <  18.82, 
-0.125  <pt<  -0.037,  -0.522  <  p2  <  -0.163;  (3)  a11  =  1.134,  a22  = 
-0.0000436,  a33  =  0.0000633,  a44  =  0.0355,  standard  errors  of  bu  b2, 
b3  =  0.0032,  0.0039,  0.092;  (4)  rows  of  A  are  (20,  98.2,  11,880),  (98.2,  506.4, 
57,284),  (11,880,  57,284,  7,201,220),  a2  =  8.235,  V(y)  =  <r2[8.62  -  1.112*  - 
0.0163z  +  0.000874*z  +  0.0603*2  +  O.OOOOlOlz2]. 

C.  (1)  yc  =  0.778  +  0.557*  +  0.1857*2;  (2)  not  at  5%  level,  F  =  4.4  with 
1  and  4  d.f.;  rc  =  0.9974,  r  =  0.9854;  (4)  yc  =  196.54  +  2.918*  -  0.0698*2  + 
0.000299*3;  (5)  cubic  regression  not  sig.,  standard  errors  1.56,  0.15,  0.0435, 
0.0435. 

D.  (1)  r2  =  0.481,  E2  =  0.584,  F  =  2.13,  linearity  acceptable;  (2)  Eyx2  = 
0.287,  Exy2  =  0.271,  No,  F  =  1.51  and  1.32. 

E.  (1)  (a)  5y'  +  *  -  15  =  0,  (b)  y  =  lOOOe-0-461*;  (2)  100j>  =  e23x  =  \0X; 
(3)  (a)  yc  =  0.643e1043x,  yc  =  0.589^1068x;  (5)  a  =  0.509,  b  =  -2.036; 
(6)  a  =  200,  b  =  4.12,  q  =  0.733,  yc  =  a[\  +  bqu]~\  u  =  (*  -  1870)/10; 
(8)  a  =  0.768,  b  =  3.86. 
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A.  (1)  0.67;  (2)  r0U2  =  0.759,  r02>1  =  0.097,  rl2t0  =  -0.436,  r0,12  =  0.802, 
Yes,  F  =  15.3;  (3)  r0tl2  =  1.29  (impossible);  (4)r02>1  =  0.707,  rl2t0  =  -0.715; 
(5)r01  =  0.187,  r02  =  0.165,  r03  =  0.468,  r12  =  0.732,  r13  =  0.20l!r23  =  0.083, 
/-01.23  =  0.0035,  r02jl3  =  0.095,  r03j21  =  0.454,  r0>123  =  0.485;  (6)  weighted 
mean  =  -0.8975. 

.  B.  (1)  /(*,  y,  z)  =  (2tt)-3/2[1  -  pxy2  -  pxz2  -  pyz2  +  2pxypyzpxzYm 
exp(-2/2),  Q  =  [*2(1  -  pyz2)  +  y\\  -  px2)  +  2xy(pxzPyz  -  pxy)  + 
2xz(pxypyz  -  pxz)  +  2yz(pxypxz  -  pyz)]/[l  -  pxy2  -  pxz2  -  pyz2  +  2pxypxzpyz\\ 
(4)  /fel*i)  =  (2*)"1/2  [Cu/CCuQ,  -  C122)]1/2  exp(-g/2),  Q  =  (C^x,  - 
Cn^/ICuCCuQ,  -  C122)];  (5)  [(C12C33  -  C13C23)x2  +  (C22C13  - 
Q 2 C2 3)*3] /[C2 2C3 3  -  C232];T2  =  20.5,  hypothesis  rejected  at  1  %  level. 

C.  (1)  L  =  0.890*!  +  3.95x2,  or  L  =  9*x  +  40*2  approx.;  (2)  L  =  -jcj 
+  40jc2  -  22*3,Li  =  1130,L2  =  2123,L  =  1487;(3)L  =  -0.0312*!  -  0.1839*2 
+  0.2221*3  +  0.3147*4,1,!  =  0.669,  L2  =  -0.384,  criterion  L  <  0.142  for  (1); 
(4)  T22  =  26.34,  highly  sig. 

D-    (1)  (oS  0S2)' [a4286  a5714];  (2)  {h  h  ^  A)'  ih  *'  A'  A)' 

(h  h  A,  tVX  matrix  not  regular;  (3)  (a)  (0.4,  0.6),  (b)  (f ,  f,  f);  (4)  P(y  ^y)  = 
2j(n  -  /)/if2,  PO'  -»  j  -  1)  =  /V,  PQ'^j  +  1)  =  (/z  -y)V,  0  <;  <  n, 
expected  number  of  black  balls  in  (1)  =  J;  (6)  0.55. 
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Aitchison,  J.,  94 
Aitken,  A.  C,  406 
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Angular  transformation,  70,  117 
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Arithmetic  mean,  see  Mean 
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criterion,  142,  145 

rule,  144,  149 
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distribution,  see  Binomial 

law  of  large  numbers,  56,  68 

numbers,  107,  395 
Bertrand,  J.,  20,  27 
Beta  distribution,  83,  205 
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Beta  function,  complete,  84,  94,  389 

incomplete,  85,  196,  245 
Beta-prime  distribution,  83,  194,  366 
Betting  odds,  9 
Bias  (estimator),  128 
Bienayme  theorem,  43 

-Chebyshev  inequality,  45 
Binomial  distribution,  52,  55 

approximations,  59,  64,  68 

confidence  limits,  114,  116,  119 

cumulants,  55 

cumulative,  53,  429 

moments,  54 

recursion  formula,  54,  56 
Binomial  graph  paper,  117 
Binomial  sequential  test,  1 66 
Binomial  theorem,  1 3 
Bivariate  normal  distribution,  148,  282, 

284,  285,  364,  380 
Block  effects,  222,  239 
Blocks,  complete,  220,  222,  234,  237 

incomplete,  237,  239 
Brandt-Snedecor  formula,  317 
Bross,  I.  D.  J.,  150 
Brown,  J.  A.  C,  94 
Buffon,  G.  L.  (Comte  de),  21,  273 

Camp,  B.  H.,  67 
Campbell,  G.,  51 
Carter,  A.  H.,  211 

Cauchy  distribution,  50,  92,  130 
Cauchy  principal  value,  50,  385,  386 
Central  limit  theorem,  90 
Change  of  variables,  386 
Chapman-Kolmogorov  equation,  373 
Characteristic  function,  43 
Charlier  check,  35 

Gram-Charlier  curves,  90 
Chebyshev  inequality,  45,  56,  68,  101 
Chi-square  distribution,  85,  136,  402 

approximations,  87 

cumulants,  86 

non-central,  136,  244,  347,  372,  397 
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Chi-square  distribution  (Cont.) 

theorems  on,  87 
Chi-square  test,  contingency,  315 

goodness  of  fit,  253,  255,  275 

of  hypothesis,  251 

for  contingency  tables,  315,  318 
Choleski  method  (square  root  method), 

330 
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Class  boundaries,  29 

interval,  30 

marks,  29,  34 
Clemm,  D.  S.,  213 
Cluster  sampling,  152,  155 
Cochran,  W.  G.,  88,  94,  173,  208,  213, 

250,  278,  327 
Coding  (of  variate),  34,  105 
Cofactor,  406 
Combinations,  12 
Complement  (set),  4 
Components  of  variance,  225,  227 
Conditional   probability,    10,    11,    134, 
397 

expectation,  156,  233,  398 
Confidence  belt,  97,  258,  296 

coefficient,  96,  97 

limits,  97,   113,   119,   183,   185,  193, 
294,  296,  334 
Conformable  matrices,  403 
Consistency,  101,  126 
Contingency,  314,  317 
Contrasts,  241 

Convergence  (stochastic),  3,  56 
Convolution,  47 
Coolidge,  J.  L.,  27 
Cornish,  E.  A.,  183 
Correlation,  intra-class,  226 

multiple,  357 

ordinary,  44,  281,287 

partial,  361,  363,  376 

rank,  308,  310,  313 

serial,  158,  171,240 
Correlation   coefficient    (Pearson),   44, 
281,  287,  357 

(Kendall),  309,  313 

(Spearman),  309,  314 

distribution  of,  306,  308 

significance  of,  295,  307,  313 
Correlation  index,  352 
Correlation  matrix,  357 


Correlation  ratio,  342 

distribution  of,  345,  347 
Covariance,  44,  226,  280,  377 
Cowden,  D.  J.,  349,  354,  355 
Cox,  G.  M.,  250 
Cramer,  G.  (rule),  330,  408 
Cramer,  H.,  128,  150,  278 

Cramer-Rao  inequality,  128 
Crossing  (factors),  232 
Cumulant  generating  function,  41,  43 
Cumulants,  41,  103,  105 
Cumulative    distribution    function,    see 

Distribution  function 
Cumulative  frequency,  29,  31 
Curtiss,  J.  H.,  51,  77 
Curve-fitting,  32,  90,  335 

goodness  of  fit,  253,  255 
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Davis,  R.  L.,  27 

Deciles,  33 

Degrees  of  freedom,  86,  180,  194,  218, 

252,316 
Delta  process,  72 
Density  (probability),  19,  46 
Design  of  experiments,  28,  219,  232 

balanced,  237 

complete,  220,  234,  237 

efficiency  factor,  239 

incomplete,  237 
Determinants,  393,  406 

functional,  see  Jacobian 
Deviation,  mean  absolute,  147 
Differentiation  under  integral  sign,  392 
Digamma  function,  147 
Discriminant  function,  366,  369 
Disjoint  (sets),  5 
Dispersion,  measures  of,  33 
Distance  (between  populations),  371 

studentized,  372 
Distribution  function,  20,  79,  256 

joint,  46,  48,  178 

of  sum  of  variates,  47 
Distribution-free  methods,  251-273 
Distributions,  special,  see  under  separate 

names 
Dixon,  W.  J.,  199,213 
Domain,  15,  17 
Doob,  J.  L.,  380 
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Double  sampling,  158 
Dwyer,  P.  S.,  354 

e,  381 

Efficient  estimator,  101,  126,  206 

Elkin,  J.  M.,  51 

Equations,  normal,  286,  329,  330,  410 

Equivalence  (events),  5 

Erdelyi,  A.,  380 

Ergodic  chain,  380 

Errors,  286,  292,  299 

curve  of,  64 

of  first  kind,  131,  132,  163 

of  second  kind,  131,  163 

true,  332 
Estimate,  standard  error  of,  294 
Estimation,  123-147 

components  of  variance,  225 

contrasts,  241 

interval,  96 

point,  96 

regression  coefficients,  297,  303 

treatment  effects,  223 

variance  from  range,  202 

X  for  given  Y,  296,  324 
Estimators,  97,  130 

consistent,  101,  126,  300,  303 

efficient,  101,  126,  130,  206 

invariant,  127 

least  squares,  208 

maximum  likelihood,  51,   123,    125, 
129,  215,  297 

most  efficient,  101,  126 

sufficient,  125,  127 

unbiased,   101,   105,   127,   153,  215, 
334 
Euler,  L.,  4 
Events,  3 

complementary,  4 

compound,  3 

independent,  11 

mutually  exclusive,  5 

simple,  3 
Expectation,  17,  102 

conditional,  156,  233,  279,  398 
Exponential  distribution,  80,  206,  212 
Exponential  function,  382 
Exponential  regression,  347 
Extreme  values,  distribution,  198,  206 

rejection  of,  199 


Factorial,  12,  388 

moment  generating  function,  41 

moments,  41,  58 

Stirling  approximation,  383 
Factors  (design),  232 
F-distribution,  193,  196,  344,  360 
Feller,  W.,  12,  27,  257,  278 
Fiducial  inference,  99 

interval,  96,  120 
Finite  population,  101,  117 
Fisher,  Sir  Ronald  A.,  41,  94,  122,  213, 
250,  327,  380 

analysis  of  variance,  241 

angular  transformation,  70,  117 

approximation  to  x2,  87 

approximation  to  t,  183 

discriminant  function,  369,  379 

distribution  of  F,  194,  21 1 

distribution  of  r,  306,  360 

distribution  of  t ,  1 80 

exact  test,  319 

extreme  values,  207 

fiducial  inference,  96 

^-statistics,  197 

inequality,  128 

^-statistics,  103 

maximum  likelihood,  123 

theorem  (chi-square),  88,  293 

z'-transformation,  307,  361 
Fisher  and  Yates  (tables),  77,  194,  213, 

235,  250,  340 
Fix,  E.,  150,  397 
Fourier  transform,  43 
Fractiles,  33 
Fraser,  D.  A.  S.,  28,  51 
Freeman,  G.  H.,  327 
Freeman,  M.  F.,  67,  77 
Frequency  curve,  3 1 

distribution,  28,  29,  289 

polygon,  30 
Frequency,  relative,  31,  52 
Function,  15 

characteristic,  43 

indicator,  16,  18,  54 

orthogonal,  87 
Functional  relation,  298 

Games,  theory  of,  147 

Gamma  function,  complete,  82,  83,  388 

incomplete,  83 
Gamma  distribution,  81,  128 
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Gauss,  C.  F.,  64 
Geiger,  H.,  253 
Geometric  distribution,  51 
Gibson,  W.  M.,  327 
Glover,  J.  W.,  14,  27,  54,  77 
Gnedenko,  B.  V.,  94 
Gompertz  curve,  354 
Goodness  of  fit,  253 
Gosset,  W.  S.,  see  Student 
Gram-Charlier  system,  90 

approximation,  61 
Grouping,  error  of,  34,  107,  395 

method  of,  303 

of  frequencies,  see  Pooling 
^-statistics,  197 
Gumbel,  E.  L,  207,  213 

Hald,  A.,  213,  250 
Halton,  J.  H.,  327 
Hansen,  M.  H.,  173 
Harmonic  mean,  93 

interpolation,  195 
Harter,  H.  L.,  213 
Hartley,  H.  O.,  27,  77,  94,  201,  213, 

216,250 
Hendricks,  W.  A.,  182,213 
Hierarchal  model,  see  Nested  model 
Hilferty,  M.  M.,  87,  94 
Histogram,  30 
Homogeneity  (of  variance),  214,  241 

chi-square  test  of,  321 
Homoscedasticity,  284,  292 
Hotelling,  Harold,  327,  346,  354,  380 

generalized  T-test,  365,  369 

distribution  of  r,  306 
Householder,  A.  S.,  354 
Houseman,  E.  S.,  340,  352,  354 
Hurwitz,  W.  N.,  173 
Hypergeometric  distribution,  58 

confluent  function,  245,  346 

function,  58,  360 

sampling,  57,  117 
Hypothesis,  alternative,  132,  162,  214 

composite,  132 

null,  131,  162,214,397 

one-sided,  132 

simple,  132,  162 

test  of,  131,397 

Implication,  5 
Improper  integrals,  385 


Independence,  of  errors,  240 

of  events,  11,  19 

of  variates,  19,  176 
Indicator  function,  16,  18,  54 
Inefficient  estimator,  correction  of,  130 
Interaction  (events),  5 

(analysis  of  variance),  220,  222,  227, 
237 
Interquartile  range,  33 
Interval,  class,  30 
Interval  estimation,  96 
Intra-class  correlation,  226 
Invariance  (estimator),  127 

Jacobian,  176,  387,  394,  400 
Jeffreys,  H.,  27 
Johnson,  L.  P.  V.,  250 
Johnson,  P.  O.,  351,  378,  380 
Jonckheere,  A.  R.,  270,  271,  278,  433 
Jordan's    method    (matrix    inversion), 

409 
Jowett,  G.  H.,  327 

Kaplansky,  I.,  51 

Kapur,  J.  N.,  327 

Keeping,  E.  S.,  51,  150,  250 

Kelley,  T.  L.,  77 

Kendall,  M.  G.,  27,  94,  150,  213,  309, 

327 
Kenney,  J.  F.,  51 
Kitigawa,  T.,  61,77 
Kolmogorov,  A.  N.,  94,  256 

test,  256,  258 
Kolmogorov-Smirnov  (two-sample)  test, 

259 
Kronecker  delta,  406 
^-statistics,  36,  103,  105,  394 

covariance  of,  110 

generalized,  104 

standard  errors  of,  109 
Kurtosis,  37,  42,  55,  61,  202,  207,  256 

variance  of,  197 

Lagrange  multiplier,  161,  187,  315,  400 
Laplace,  P.  S.,  144 

distribution,  50 
Large  numbers,  law  of,  3,  56 
Latin  square,  235 

Least  squares,  method,  286,  300,  347, 
349 

weighted,  348 
Leibniz  formula,  393 
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Lev,  J., -278 

Levene,  H.,  268,  270,  278 
Lieberman,  G.  J.,  190,  213 
Likelihood,  123,  214,  315 

maximum,  see  Maximum  likelihood 

ratio,  135,  162,  166 
Lindeberg,  J.  W.,  91 
Lindley,  D.  V.,  300,  327 
Linearity,  test  for,  337,  342 
Logarithmic  transformation,  90 
Logistic  curve,  354 
Log-normal  distribution,  89 
Loss  function,  143,  145 
Luce,  R.  D.,  150 
Lyapunov,  A.,  92 

Madansky,  A.,  327 
Madow,  W.  G.,  and  L.  H.,  173 
Mahalonobis,  P.  C,  371 
Mainland,  D.,  119,  122,  321,  327 
Mann,  H.  B.,  265,  278 

Mann-Whitney  test,  265,  273,  417 
Markov  chain,  373 

regular,  374 

ergodic,  380 
Markov  inequality,  45 

process,  373,  375 
Massey,  F.  J.,  257,  259,  278 
Matching  problem,  25 
Mathematical   model,    217,    220,    235, 

238 
Matrix  algebra,  402-412 
Matrix  addition,  403 

adjoint  of,  407 

diagonal,  405 

division,  407 

equality,  403 

inverse  of,  330,  407,  408 

multiplication,  403 

null  (zero),  403 

of  observations,  328 

orthogonal,  408 

rank  of,  407 

regular,  374 

singular,  364,  407 

skew-symmetric,  405 

square,  402 

symmetric,  329,  405 

transpose  of,  404 

triangular,  410 

unit,  405 


Maximum  likelihood,  136,  187,  330 

estimators,  51,   123,   129,  215,  297, 
300 
May,  M.  A.,  363 
Mean,  arithmetic,  34,  280 

confidence  limits,  113 

distribution  of,  111,  175 

experiment  on,  112 

standard  error  of,  109 

of  population,  37 
Means,  difference  of,  121,  165,  216 
Mean  squares  (A.  of  V.),  219 
Measure,  6,  7,  12,  16 
Median,  32,  50,  204 

distribution  of,  204,  205 
Mendel,  G.,  252 
Mere,  Chevalier  de,  15 
Merrington,  M.,  197,  213,  216,  250 
Mid-range,  119,  212 
Miller,  L.  H.,  257,  278 
Minimax  principle,  142,  146 
Mises,  R.  von,  27,  267 
Mode,  50 

Models  (A.  of  V.),  223,  225,  228 
Modified   exponential   regression,    349, 

353 
Molina,  E.  C,  61,  77 
Moment  generating  function,  40 

factorial,  41 
Moments,  about  mean,  35,  38 

about  zero,  34,  37,  281 

factorial,  41 

relation  to  ^-statistics,  105 
Moore,  G.  H.,  268,  278 
Moses,  L.  E.,  278 
Moshman,  J.,  94 
Most  efficient  estimators,  101 
Mosteller,  F.,  117,  122 
Multinomial  distribution,  252,  364,  401 

theorem,  400 
Multiple  correlation,  357,  360 

regression,  328,  356 
Multiplication  law  (probability),  11 
Multiplier,    undetermined,     161,     187, 

315,  400 
Multivariate  analysis,  356-376 

normal  distribution,  363,  377 

Negative  binomial  distribution,  93,  120 

Nested  model  (A.  of  V.),  232 
Newman-Keuls  procedure,  242 
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Neyman,  J.,  96,  122,  150,  173,  213 

confidence  intervals,  96 

likelihood  ratio,  135,  368 

optimum  sampling,  155 

power  of  t-test,  191 

randomized    Neyman-Pearson    theo- 
rem, 139 
Non-central  distributions, 

chi-square,  136,  244,  347,  372,  397 

F,  366,  372 

t,  190,  192 
Non-normality,  effect  of,  207,  240,  245, 
251 

tests  for,  208,  256 
Non-parametric  tests,  251-273 
Normal  correlation  surface,  284 
Normal  distribution,  64,  66,  174-208 

approximation  by,  64,  66,  68,  321 

cumulants  of,  69 

standard  form,  66,  68,  70,  77,  391 
Normal  equations,  286,  329,  330,  410 
Normal  process,  375 
Normal  test,  one-sided,  136 

two-sided,  138 
Normality,    assumption    of,    174,    207, 

240,  245,  251 
Nuisance  parameters,  125 
Null  hypothesis,  131 
Null  set,  5 

Numbers,  random,  see  Random  num- 
bers 

Ogive,  32 

Operating  characteristic,  134,  168 
Order  statistics,  204 
Orthogonal  functions,  87,  293 

matrix,  408 

polynomials,  337,  338 

residuals,  333 

transformation,  88,  393,  402 

Paired  count  (binomial),  117 

Paired  samples,  185,  260,  262 

Parabolic  regression,  335,  337 

Parameters,  37,  95 

Pareto  distribution,  50 

Paulson,  E.,  67 

Pearson,  E.  S.,  27,  77,  94,   122,   150, 
198,  201,213,250 
confidence  intervals,  96 
effect  of  non-normality,  208 


Pearson,  K.,  285,  380 

coefficient  of  correlation,  44,  281 

system,  90,  94 

tables,  83,  85,  94,  346 
Percentile  rank,  33 
Percentiles,  33 
Permutations,  12 
Petersburg  paradox,  26 
Plackett,  R.  L.,  355,  380 
Point  estimation,  96 
Poisson  distribution,  59,  61,  77,  83,  93, 
144 

approximation  by,  61 

confidence  limits,  98 

cumulants,  61 

parametric  test  of,  254 
Poisson  sampling  scheme,  56 
Polynomials,  fitting  of,  335 

orthogonal,  337 
Pooling    (frequency   table),   252,   256, 

275 
Population,  finite,  102,  117 

mean,  37 

relation  to  sample,  37,  95-119 
Power,  of  test,  134,  168 

chi-square  test,  254,  397 

F-test,  196,  227,  244,  245 

Kolmogorov  test,  258 

sign  test,  262 

Mann-Whitney  test,  267 

run  tests,  270 

f-test,  189 

Walsh  test,  264 

Wilcoxon  test,  264 
Precision,  211 

Predicted  value,  variance  of,  331 
Predictors,  328,  356 
Probability,  1-23 

addition  law,  7 

conditional,  10,  11,  134,  279,  397 

continuous,  19 

density,  19 

distribution,  37,  77,  78-92 

graph  paper,  70,  77 

interpretation,  2,  9 

joint,  279 

measure,  6,  7,  12 

multiplication  law,  11 

transition,  373 
Probability  transformation,  79 
Product  moment,  281 
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Proportion    (of  defectives),    140,   145, 
166,  191 

Quadratic  form,  88 
Quality  control,  198,  201,  375 
Quantiles,  see  Fractiles 
Quartiles,  33 

Raff,  M.  S.,  77 

Raiffa,  H.,  150 
Rainville,  E.  D.,  380 
Random  numbers,  22,  27,  140 

sample,  13 

variable,  16,  372 

walk,  162,  372 
Randomization,  23,  139 
Randomized  decision,  139 
Randomness,  1 

tests  for,  267 
Range,  15,  17,  36,  120 

as  estimator  of  variance,  202 

critical,  242 

distribution  of,  120,  200,  203,  212 

standardized,  201,  242 

studentized,  242 
Ranges,  quotient  of,  204 
Rank,  of  matrix,  407 

of  observations,  265,  272 

of  pair  differences,  262 
Rank  correlation,  272,  308,  313 
Rao,  C.  R.,  128 
Rectangular  distribution,  40,  78,  203, 

212 
Regression,  279-322,  328-350 

and  maximum  likelihood,  330 

curvilinear,  279,  335 

exponential,  347 

for  population,  279 

linear,  279,  281 

multiple,  328,  356 

parabolic,  335,  337 

partial,  328 

plane,  356 

sample,  285 

test  for  linearity,  342 

when  variables  subject  to  error,  302 

when  X  not  random,  296 
Regression  coefficients,  279,  287,  292, 
328,  330 

confidence  limits,  294 


Rejection  error,  131 

of  extreme  observations,  199 
Residuals,  332,  358 
Resnikoff,  G.  J.,  190,  213 
Reversal  rule,  405,  407 
Rider,  P.  R.,  204,  213 
Risk  function,  143 
Robustness  (test),  174,  251 
Runs,  distribution  of,  269 

up-and-down,  268 
Rutherford,  Sir  E.,  253 

Sample,  95-119,  151-173 

likelihood  of,  48,  123,  214,  315 

paired,  185 

size,  189,  193,  197 

space,  3,  4,  17,  178,  356 
Sampling,  13,  151-173 

cluster,  152,  155 

cost  of,  155,  160 

double,  158 

inspection,  191 

numbers,  22,  27 

optimum,  155,  160 

proportionate,  154 

purposive,  152 

quota,  152 

random,  13,  95,  151 

sequential,  131,  162,  373 

stratified,  95,  151,  153 

systematic,  13,  152,  157,  171 

with  replacements,  13,  52 
Satterthwaite,  F.  E.,  228,  250 
Savage,  L.  J.,  9,  27,  51 
Saxena,  H.  C,  327 
Scatter  diagram,  286 
Scheffe,  H.,  248,  249,  250 
Seidel,  P.  L.,  349 
Sequential  sampling,  131,  163,  165,  373 

test,  165 
Serial  correlation,  158,  171,  240 
Sets,  finite,  4 

Sheppard's  corrections,  107,  395 
Siegel,  S.,  278 
Sign  test,  260,  414 
Signed-rank  test,  262 
Significance,  level  of,  100 
Size,  of  test,  133 
Skewness,  37,  55,  61,  207,  256 

variance  of,  197 
Smirnov,  N.,  259,  260,  278 
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Smith,  B.  Babington,  27 
Smith,  K.,  278 

Snedecor,  G.  W.,  210,  277,  345,  348, 
355 

F-distribution,  194,  344 

Brandt-Snedecor  formula,  317 
Sokolnikoff,  I.  S.,  393 
Spearman,  C,  309,  314 
Sprowls,  R.  C,  146,  150 
Square  root  method,  330,  410 

transformation,  72 
Srivastava,  A.  B.  L.,  208,  213 
Standard  deviation,  36,  38,  104 

confidence  limits,  193 

mean  and  variance  of,  209 
Standard  error,  109,  110 

of  estimate,  294 
Standardized  variate,  64 
Stationary  process,  375 
Statistic,  33 

as  estimator,  97,  130 
Stirling  approximation,  65,  76,  181,  209, 

252,383,401 
Stochastic  process,  372 
Straggler,  see  Extreme 
Stratification,  95,  151,  153 
Structural  relation,  285,  302 
Stuart,  A.,  94,  213 
Student  (W.  S.  Gosset),  180,  212,  380 

/-distribution,    149,    179,    182,    189, 
199,  262,  272,  294 
Sufficiency,  125,  127 
Sum  of  squares,  A.  of  V.,  217,  221,  226 

of  residuals,  333 
Systematic  sampling,  13,  152,  157,  171 

Tang,  P.  C,  244,  245,  250,  347,  355, 

366 
Tchebycheff,  see  Chebyshev 
f-distribution,  149,  179,  182,  184,  199, 
272,  294 

non-central,  190,  192 
T-distribution,  365,  369,  371 
Test,  of  hypotheses,  131,  139 

likelihood  ratio,  162,  166 

one-tailed,  136,  186 

power  of,  134,  189,  190,  196,  262 

sequential,  162,  166 

two-tailed,  136,  186 

unbiased,  189 

uniformly  most  powerful,  135,  189 


Test  function  (Neyman-Pearson),  139 
Thompson,  C.  M.,  197,  213,  216,  250 
Thompson,  W.  R.,  199,  213 
Ties,  Mann- Whitney  test,  267 

rank  correlation,  310,  312 

Wilcoxon  test,  262 
Tippett,  L.  H.  C,  198,  201,  207,  213 
Todhunter,  I.,  27 
Tokarska,  B.,  191,213 
Tolerance  limits,  168 
Transformation,  angular,  70 

Fisher's,  307 

linear,  393 

logarithmic,  90,  241 

of  variates,  79,  208,  241 

orthogonal,  88,  393,  402 

probability,  79 

square  root,  72 
Transition  probability,  373 
Treatments  (A.  of  V.),  219,  222,  235, 

238 
Trend  line,  empirical,  342 
Trigamma  function,  147 
Tukey,  J.  W.,  67,  77,  103,  117,  122 
Two-by-two  table,  10,  318 

Yates's  correction,  319 

Fisher's  exact  method,  319 

Pearson's  approximation,  321 

Unbiased  estimator,  101,  105,  127 
Union  (events),  5 
Universal  set,  4 
Uspensky,  J.  V.,  27 
Utility,  9,  142 

Variable,  random,  see  Variate 
Variables  subject  to  error,  298 
Variance,  36,  280 

about  regression,  291,  294 

analysis  of,  see  Analysis  of  variance 

components  of,  225 

conditional,  156,  233,  279,  284,  398 

confidence  limits,  193 

constancy  of,  72,  90 

distribution,  89,  175 

due  to  regression,  337,  341,  358 

estimated  from  range,  202 

homogeneity  of,  214,  241 

of  estimate,  282,  358 

of  ^-statistics,  108,  109 
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Variance  (Cont.) 

of  linear  function,  43 

of  maximum  likelihood  estimator,  127 

of  population,  38,  193 

of  regression  coefficients,  292,   334, 
341 

standard  error  of,  109 
Variance  ratio  {see  F-distribution) 
Variates,  16 

coded,  34 

independent,  19 

uncorrelated,  44 

standardized,  64 

predicted,  328 
Variation,  coefficient  of,  90 


Vector,  329,  403 
Venn,  J.,  4 

Wald,  A.,  142,  146,  162,  173,  303 

Walker,  H.  M.,  278 

Wallis,  W.  A.,  268,  278 

Wallis's  formula,  384,  392 

Walsh,  J.  E.,  264,  278,  430 

Whitney,  D.  R.,  265,  278 

Wilcoxon,  F.,  262,  263,  264,  265,  278 

Wilks,  S.  S.,  136,  150 

Wilson,  E.  B.,  87,  94 

Woo,  T.  L.,  316,346,355 

Yates,  F.,  319 

z'-transformation,  307,  361 
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